Building Custom Q&A Applications Using LangChain and Pinecone Vector Database

24 July 2024

0

Introduction

The advent of large language models is one of our time’s most exciting technological developments. It has opened up endless possibilities in artificial intelligence, offering solutions to real-world problems across various industries. One of the fascinating applications of these models is developing custom question-answering or chatbots that draw from personal or organizational data sources. However, since LLMs are trained on general data available publicly, their answers may not always be specific or useful to the end user. We can use frameworks such as LangChain to solve this issue to develop custom chatbots that provide specific answers based on our data. In this article, we will learn how to build custom Q&A applications with deployment on the Streamlit Cloud.

Image credits: Mirantha Jayathilaka, PhD

Learning objectives

Before diving deep into the article, let’s outline the key learning objectives:

Learn the entire workflow of custom question and answering and what’s the role of each component in the workflow
Know the advantage of Q&A application over fine-tuning custom LLM
Learn the basics of the Pinecone vector database to store and retrieve vectors
Build the semantic search pipeline using OpenAI LLMs, LangChain, and the Pinecone vector database to develop a streamlit application.

This article was published as a part of the Data Science Blogathon.

Introduction
Overview of Q&A Applications
Q&A Applications Workflow
Advantages of Custom Q&A Applications Over a Model Fine-tuning
What is a Pinecone Vector Database?
Building a Semantic Search Pipeline Using OpenAI and Pinecone
Custom Question Answering Application with Streamlit
Industry Use-cases of Custom Q&A Applications
Conclusion
Frequently Asked Questions

Overview of Q&A Applications

Question-answering or “chat over your data” is a widespread use case of LLMs and LangChain. LangChain provides a series of components to load any data sources you can find for your use case. It supports many data sources and transformers to convert into a series of strings to store in vector databases. Once the data is stored in a database, one can query the database using components called retrievers. Moreover, by using LLMs, we can get accurate answers like chatbots without juggling through tons of documents.

LangChain supports the following data sources. As you can see in the image, it allows over 120 integrations to connect every data source you may have.

Image credit: LangChain Docs — Source: LangChain Docs

Q&A Applications Workflow

We learned about the data sources supported by LangChain, which allows us to develop a question-answering pipeline using the components available in LangChain. Below are the components used in document loading, storage, retrieval, and generating output by LLM.

Document loaders: To load user documents for vectorization and storage purposes
Text splitters: These are the document transformers that transform documents into fixed chunk lengths to store them efficiently
Vector storage: Vector database integrations to store vector embeddings of the input texts
Document retrieval: To retrieve texts based on user queries to the database. They use similarity search techniques to retrieve the same.
Model output: Final model output to the user query generated from the input prompt of query and retrieved texts.

This is the high-level workflow of the question-answering pipeline, which can solve many real-world problems. I haven’t gone deep into each LangChain Component, but if you are looking to learn more about it, then check out my previous article published on Analytics Vidhya (Link: Click Here)

Q&A app - A workflow diagram (Image by Author) — Q&A app – A workflow diagram (Image by Author)

Advantages of Custom Q&A Applications Over a Model Fine-tuning

Context-specific answers
Adaptable to new input documents
No need to fine-tune the model, which saves the cost of model training
More accurate and specific answers rather than general answers

What is a Pinecone Vector Database?

Pinecone is a popular vector database used in building LLM-powered applications. It is versatile and scalable for high-performance AI applications. It’s a fully managed, cloud-native vector database with no infrastructure hassles from users.

LLM bases applications involve large amounts of unstructured data, which require sophisticated long-term memory to retrieve information with maximum accuracy. Generative AI applications rely on semantic search on vector embeddings to return suitable context based on user input.

Pinecone is well suited for such applications and optimized to store and query many vectors with low latency to build user-friendly applications. Let’s learn how to create a pinecone vector database for our question-answering application.

# install pinecone-client
pip install pinecone-client

# import pinecone and initialize with your API key and environment name
import pinecone
pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")

# create your first index to get started with storing vectors 
pinecone.create_index("first_index", dimension=8, metric="cosine")

# Upsert sample data (5 8-dimensional vectors)
index.upsert([
    ("A", [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]),
    ("B", [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]),
    ("C", [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]),
    ("D", [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]),
    ("E", [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
])

# Use list_indexes() method to call a number of indexes available in db
pinecone.list_indexes()

[Output]>>> ['first_index']

In the above demonstration, we install a pinecone client to initialize a vector database in our project environment. Once the vector database is initialized, we can create an index with the required dimension and metric to insert vector embeddings into the vector database. In the next section, we will develop a semantic search pipeline using Pinecone and LangChain for our application.

Building a Semantic Search Pipeline Using OpenAI and Pinecone

We learned that there are 5 steps in the question-answering application workflow. In this section, we will perform the first 4 steps: document loaders, text splitters, vector storage, and document retrieval.

To perform these steps in your local environment or cloud bases notebook environment like Google Colab, you must install some libraries and create an account on OpenAI and Pinecone to obtain their API keys, respectively. Let’s start with the environment setup:

Step 1: Installing Required Libraries

# install langchain and openai with other dependencies
!pip install --upgrade langchain openai -q
!pip install pillow==6.2.2
!pip install unstructured -q
!pip install unstructured[local-inference] -q
!pip install detectron2@git+https://github.com/facebookresearch/[email protected] /
                                            #egg=detectron2 -q
!apt-get install poppler-utils
!pip install pinecone-client -q
!pip install tiktoken -q

# setup openai environment
import os
os.environ["OPENAI_API_KEY"] = "YOUR-API-KEY"

# importing libraries
import os
import openai
import pinecone
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

After the installation setup, import all the libraries mentioned in the above code snippet. Then, follow the next steps below:

Step 2: Load the Documents

In this step, we will load the documents from the directory as a starting point for the AI project pipeline. we have 2 documents in our directory, which we will load into our project environment.

#load the documents from content/data dir
directory = '/content/data'

# load_docs functions to load documents using langchain function
def load_docs(directory):
  loader = DirectoryLoader(directory)
  documents = loader.load()
  return documents

documents = load_docs(directory)
len(documents)
[Output]>>> 5

Step 3: Split the Texts Data

Text embeddings and LLMs perform better if each document has a fixed length. Thus, Splitting texts into equal lengths of chunks is necessary for any LLM use case. we will use ‘RecursiveCharacterTextSplitter’ to convert documents into the same size as text documents.

# split the docs using recursive text splitter
def split_docs(documents, chunk_size=200, chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs

# split the docs
docs = split_docs(documents)
print(len(docs))
[Output]>>>12

Step 4: Store the Data in Vector Storage

Once the documents are split, we will store their embeddings in the vector database Using OpenAI embeddings.

# embedding example on random word
embeddings = OpenAIEmbeddings()

# initiate pinecondb
pinecone.init(
    api_key="YOUR-API-KEY",
    environment="YOUR-ENV"
)

# define index name
index_name = "langchain-project"

# store the data and embeddings into pinecone index
index = Pinecone.from_documents(docs, embeddings, index_name=index_name)

Step 5: Retrieve Data from the Vector Database

We will retrieve the documents at this stage using a semantic search from our vector database. we have vectors stored in an index called “langchain-project” and once we query to the same as below, we would get most similar documents from the database.

# An example query to our database
query = "What are the different types of pet animals are there?"

# do a similarity search and store the documents in result variable 
result = index.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)
-
--------------------------------[Output]--------------------------------------
result
[Document(page_content='Small mammals like hamsters, guinea pigs, 
and rabbits are often chosen for their
low maintenance needs. Birds offer beauty and song,
and reptiles like turtles and lizards can make intriguing pets.', 
metadata={'source': '/content/data/Different Types of Pet Animals.txt'}),
 Document(page_content='Pet animals come in all shapes and sizes, each suited 
to different lifestyles and home environments. Dogs and cats are the most 
common, known for their companionship and unique personalities. Small', 
metadata={'source': '/content/data/Different Types of Pet Animals.txt'}),
 Document(page_content='intriguing pets. Even fish, with their calming presence
, can be wonderful pets.', 
metadata={'source': '/content/data/Different Types of Pet Animals.txt'})]

We can retrieve the documents based on a similarity search from the vector store, as shown in the above code snippet. If you are looking to learn more about semantic search applications. I highly recommend reading my previous article on this topic (link: click here)

Custom Question Answering Application with Streamlit

In the final stage of the question-answering application, we will integrate every workflow component to build a custom Q&A application that allows users to input various data sources like web-based articles, PDFs, CSVs, etc., to chat with it. thus making them productive in their daily activities. We need to create a GitHub repository and add the following files.

Repo structure | Q&A Applications — Repo structure

Add these Project Files

main.py — A python file containing streamlit front-end code
qanda.py — Prompt design and Model output function to return an answer to users’ query
utils.py — Utility functions to load and split input documents
vector_search.py — Text embeddings and Vector storage function
requirements.txt — Project dependencies to run the application in streamlit public cloud

We are supporting two types of data sources in this project demonstration:

Web URL-based text data
Online PDF files

These two types contain a wide range of text data and are most frequent for many use cases. You can see the main.py python code below to understand the app’s user interface.

# import necessary libraries
import streamlit as st
import openai
import qanda
from vector_search import *
from utils import *
from io  import StringIO

# take openai api key in
api_key = st.sidebar.text_input("Enter your OpenAI API key:", type='password')
# open ai key
openai.api_key = str(api_key)

# header of the app
_ , col2,_ = st.columns([1,7,1])
with col2:
    col2 = st.header("Simplchat: Chat with your data")
    url = False
    query = False
    pdf = False
    data = False
    # select option based on user need
    options = st.selectbox("Select the type of data source",
                            options=['Web URL','PDF','Existing data source'])
    #ask a query based on options of data sources
    if options == 'Web URL':
        url = st.text_input("Enter the URL of the data source")
        query = st.text_input("Enter your query")
        button = st.button("Submit")
    elif options == 'PDF':
        pdf = st.text_input("Enter your PDF link here") 
        query = st.text_input("Enter your query")
        button = st.button("Submit")
    elif options == 'Existing data source':
        data= True
        query = st.text_input("Enter your query")
        button = st.button("Submit") 

# write code to get the output based on given query and data sources   
if button and url:
    with st.spinner("Updating the database..."):
        corpusData = scrape_text(url)
        encodeaddData(corpusData,url=url,pdf=False)
        st.success("Database Updated")
    with st.spinner("Finding an answer..."):
        title, res = find_k_best_match(query,2)
        context = "\n\n".join(res)
        st.expander("Context").write(context)
        prompt = qanda.prompt(context,query)
        answer = qanda.get_answer(prompt)
        st.success("Answer: "+ answer)

# write a code to get output on given query and data sources
if button and pdf:
    with st.spinner("Updating the database..."):
        corpusData = pdf_text(pdf=pdf)
        encodeaddData(corpusData,pdf=pdf,url=False)
        st.success("Database Updated")
    with st.spinner("Finding an answer..."):
        title, res = find_k_best_match(query,2)
        context = "\n\n".join(res)
        st.expander("Context").write(context)
        prompt = qanda.prompt(context,query)
        answer = qanda.get_answer(prompt)
        st.success("Answer: "+ answer)
        
if button and data:
    with st.spinner("Finding an answer..."):
        title, res = find_k_best_match(query,2)
        context = "\n\n".join(res)
        st.expander("Context").write(context)
        prompt = qanda.prompt(context,query)
        answer = qanda.get_answer(prompt)
        st.success("Answer: "+ answer)
        
        
# delete the vectors from the database
st.expander("Delete the indexes from the database")
button1 = st.button("Delete the current vectors")
if button1 == True:
    index.delete(deleteAll='true')

To check other code files, please visit the project’s GitHub repository. (Link: Click Here)

Deployment of the Q&A App on Streamlit Cloud

Application UI | Q&A Applications — Application UI

Streamlit provides a community cloud to host applications free of cost. Moreover, streamlit is easy to use due to its automated CI/CD pipeline features. To learn more about streamlit to build apps — Please visit my previous article I wrote on Analytics Vidya (Link: Click Here)

Industry Use-cases of Custom Q&A Applications

Adopt custom question-answering applications in many industries as new and innovative use cases emerge in this field. Let’s look at such use cases:

Customer Support Assistance

The revolution of customer support has begun with the rise of LLMs. Whether it’s an E-commerce, telecommunication, or Finance industry, customer service bots developed on a company’s documents can help customers make faster and more informed decisions, resulting in increased revenue.

Healthcare Industry

The information is crucial for patients to get timely treatment for certain diseases. Healthcare companies can develop an interactive chatbot to provide medical information, drug information, symptom explanations, and treatment guidelines in natural language without needing an actual person.

Legal Industry

Lawyers deal with vast amounts of legal information and documents to solve court cases. Custom LLM applications developed using such large amounts of data can help lawyers to be more efficient and solve cases much faster.

Technology Industry

The biggest game-changing use case of Q&A applications is programming assistance. tech companies can build such apps on their internal code base to help programmers in problem-solving, understanding code syntax, debugging errors, and implementing specific functionalities.

Government and Public Services

Government policies and schemes contain vast information that can overwhelm many people. Citizens can get information on government programs and regulations by developing custom applications for such government services. It can also help in filling out government forms and applications correctly.

Conclusion

In conclusion, we have explored the exciting possibilities of building a custom question-answering application using LangChain and the Pinecone vector database. This blog has taken us through the fundamental concepts, from an overview of the question-answering application to understanding the capabilities of the Pinecone vector database. Combining the power of OpenAI’s semantic search pipeline with Pinecone’s efficient indexing and retrieval system, we have harnessed the potential to create a robust and accurate question-answering solution with streamlit. let’s look at the key takeaways from the article:

Key Takeaways

Large language models (LLMs) have revolutionized AI, enabling diverse applications. Customizing chatbots with personal or organizational data is a powerful approach.
While general LLMs offer a broad understanding of language, tailored question-answering applications offer a distinct advantage over fine-tuned personalized LLMs dues to their flexibility and cost-effectiveness.
By incorporating the Pinecone vector database, OpenAI LLMs, and LangChain, we learned how to develop a semantic search pipeline and deploy it on a cloud-based platform like streamlit.

Frequently Asked Questions

Q1: What are pinecone and LangChain?

A: Pinecone is a scalable long-term memory vector database to store text embeddings for LLM-powered applications, while LangChain is a framework that allows developers to build LLM-powered applications.

Q2: What is the application of NLP question answering?

A: Use Question-answering applications in customer support chatbots, academic research, e-Learning, etc.

Q3: Why should I use LangChain?

A: LangChain allows developers to use various components to integrate these LLMs in the most developers-friendly way possible, thus shipping products faster.

Q4: What are the steps to build a Q&A application?

A: Steps to build a Q&A application are Document loading, text splitter, vector storage, retrieval, and model output.

Q5: What are LangChain tools?

A: LangChain has the following tools: Document loaders, Document transformers, Vector stores, Chains, Memory, and Agents.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

A

Avikumar Talaviya

27 Oct 2023

Beginner Database Generative AI Github LLMs