Introduction
The advent of large language models is one of our time’s most exciting technological developments. It has opened up endless possibilities in artificial intelligence, offering solutions to real-world problems across various industries. One of the fascinating applications of these models is developing custom question-answering or chatbots that draw from personal or organizational data sources. However, since LLMs are trained on general data available publicly, their answers may not always be specific or useful to the end user. We can use frameworks such as LangChain to solve this issue to develop custom chatbots that provide specific answers based on our data. In this article, we will learn how to build custom Q&A applications with deployment on the Streamlit Cloud.
Learning objectives
Before diving deep into the article, let’s outline the key learning objectives:
- Learn the entire workflow of custom question and answering and what’s the role of each component in the workflow
- Know the advantage of Q&A application over fine-tuning custom LLM
- Learn the basics of the Pinecone vector database to store and retrieve vectors
- Build the semantic search pipeline using OpenAI LLMs, LangChain, and the Pinecone vector database to develop a streamlit application.
This article was published as a part of the Data Science Blogathon.
Table of contents
- Introduction
- Overview of Q&A Applications
- Q&A Applications Workflow
- Advantages of Custom Q&A Applications Over a Model Fine-tuning
- What is a Pinecone Vector Database?
- Building a Semantic Search Pipeline Using OpenAI and Pinecone
- Custom Question Answering Application with Streamlit
- Industry Use-cases of Custom Q&A Applications
- Conclusion
- Frequently Asked Questions
Overview of Q&A Applications
Question-answering or “chat over your data” is a widespread use case of LLMs and LangChain. LangChain provides a series of components to load any data sources you can find for your use case. It supports many data sources and transformers to convert into a series of strings to store in vector databases. Once the data is stored in a database, one can query the database using components called retrievers. Moreover, by using LLMs, we can get accurate answers like chatbots without juggling through tons of documents.
LangChain supports the following data sources. As you can see in the image, it allows over 120 integrations to connect every data source you may have.
Q&A Applications Workflow
We learned about the data sources supported by LangChain, which allows us to develop a question-answering pipeline using the components available in LangChain. Below are the components used in document loading, storage, retrieval, and generating output by LLM.
- Document loaders: To load user documents for vectorization and storage purposes
- Text splitters: These are the document transformers that transform documents into fixed chunk lengths to store them efficiently
- Vector storage: Vector database integrations to store vector embeddings of the input texts
- Document retrieval: To retrieve texts based on user queries to the database. They use similarity search techniques to retrieve the same.
- Model output: Final model output to the user query generated from the input prompt of query and retrieved texts.
This is the high-level workflow of the question-answering pipeline, which can solve many real-world problems. I haven’t gone deep into each LangChain Component, but if you are looking to learn more about it, then check out my previous article published on Analytics Vidhya (Link: Click Here)
Advantages of Custom Q&A Applications Over a Model Fine-tuning
- Context-specific answers
- Adaptable to new input documents
- No need to fine-tune the model, which saves the cost of model training
- More accurate and specific answers rather than general answers
What is a Pinecone Vector Database?
Pinecone is a popular vector database used in building LLM-powered applications. It is versatile and scalable for high-performance AI applications. It’s a fully managed, cloud-native vector database with no infrastructure hassles from users.
LLM bases applications involve large amounts of unstructured data, which require sophisticated long-term memory to retrieve information with maximum accuracy. Generative AI applications rely on semantic search on vector embeddings to return suitable context based on user input.
Pinecone is well suited for such applications and optimized to store and query many vectors with low latency to build user-friendly applications. Let’s learn how to create a pinecone vector database for our question-answering application.
# install pinecone-client
pip install pinecone-client
# import pinecone and initialize with your API key and environment name
import pinecone
pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")
# create your first index to get started with storing vectors
pinecone.create_index("first_index", dimension=8, metric="cosine")
# Upsert sample data (5 8-dimensional vectors)
index.upsert([
("A", [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]),
("B", [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]),
("C", [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]),
("D", [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]),
("E", [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
])
# Use list_indexes() method to call a number of indexes available in db
pinecone.list_indexes()
[Output]>>> ['first_index']
In the above demonstration, we install a pinecone client to initialize a vector database in our project environment. Once the vector database is initialized, we can create an index with the required dimension and metric to insert vector embeddings into the vector database. In the next section, we will develop a semantic search pipeline using Pinecone and LangChain for our application.
Building a Semantic Search Pipeline Using OpenAI and Pinecone
We learned that there are 5 steps in the question-answering application workflow. In this section, we will perform the first 4 steps: document loaders, text splitters, vector storage, and document retrieval.
To perform these steps in your local environment or cloud bases notebook environment like Google Colab, you must install some libraries and create an account on OpenAI and Pinecone to obtain their API keys, respectively. Let’s start with the environment setup:
Step 1: Installing Required Libraries
# install langchain and openai with other dependencies
!pip install --upgrade langchain openai -q
!pip install pillow==6.2.2
!pip install unstructured -q
!pip install unstructured[local-inference] -q
!pip install detectron2@git+https://github.com/facebookresearch/[email protected] /
#egg=detectron2 -q
!apt-get install poppler-utils
!pip install pinecone-client -q
!pip install tiktoken -q
# setup openai environment
import os
os.environ["OPENAI_API_KEY"] = "YOUR-API-KEY"
# importing libraries
import os
import openai
import pinecone
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
After the installation setup, import all the libraries mentioned in the above code snippet. Then, follow the next steps below:
Step 2: Load the Documents
In this step, we will load the documents from the directory as a starting point for the AI project pipeline. we have 2 documents in our directory, which we will load into our project environment.
#load the documents from content/data dir
directory = '/content/data'
# load_docs functions to load documents using langchain function
def load_docs(directory):
loader = DirectoryLoader(directory)
documents = loader.load()
return documents
documents = load_docs(directory)
len(documents)
[Output]>>> 5
Step 3: Split the Texts Data
Text embeddings and LLMs perform better if each document has a fixed length. Thus, Splitting texts into equal lengths of chunks is necessary for any LLM use case. we will use ‘RecursiveCharacterTextSplitter’ to convert documents into the same size as text documents.
# split the docs using recursive text splitter
def split_docs(documents, chunk_size=200, chunk_overlap=20):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
docs = text_splitter.split_documents(documents)
return docs
# split the docs
docs = split_docs(documents)
print(len(docs))
[Output]>>>12
Step 4: Store the Data in Vector Storage
Once the documents are split, we will store their embeddings in the vector database Using OpenAI embeddings.
# embedding example on random word
embeddings = OpenAIEmbeddings()
# initiate pinecondb
pinecone.init(
api_key="YOUR-API-KEY",
environment="YOUR-ENV"
)
# define index name
index_name = "langchain-project"
# store the data and embeddings into pinecone index
index = Pinecone.from_documents(docs, embeddings, index_name=index_name)
Step 5: Retrieve Data from the Vector Database
We will retrieve the documents at this stage using a semantic search from our vector database. we have vectors stored in an index called “langchain-project” and once we query to the same as below, we would get most similar documents from the database.
# An example query to our database
query = "What are the different types of pet animals are there?"
# do a similarity search and store the documents in result variable
result = index.similarity_search(
query, # our search query
k=3 # return 3 most relevant docs
)
-
--------------------------------[Output]--------------------------------------
result
[Document(page_content='Small mammals like hamsters, guinea pigs,
and rabbits are often chosen for their
low maintenance needs. Birds offer beauty and song,
and reptiles like turtles and lizards can make intriguing pets.',
metadata={'source': '/content/data/Different Types of Pet Animals.txt'}),
Document(page_content='Pet animals come in all shapes and sizes, each suited
to different lifestyles and home environments. Dogs and cats are the most
common, known for their companionship and unique personalities. Small',
metadata={'source': '/content/data/Different Types of Pet Animals.txt'}),
Document(page_content='intriguing pets. Even fish, with their calming presence
, can be wonderful pets.',
metadata={'source': '/content/data/Different Types of Pet Animals.txt'})]
We can retrieve the documents based on a similarity search from the vector store, as shown in the above code snippet. If you are looking to learn more about semantic search applications. I highly recommend reading my previous article on this topic (link: click here)
Custom Question Answering Application with Streamlit
In the final stage of the question-answering application, we will integrate every workflow component to build a custom Q&A application that allows users to input various data sources like web-based articles, PDFs, CSVs, etc., to chat with it. thus making them productive in their daily activities. We need to create a GitHub repository and add the following files.
Add these Project Files
- main.py — A python file containing streamlit front-end code
- qanda.py — Prompt design and Model output function to return an answer to users’ query
- utils.py — Utility functions to load and split input documents
- vector_search.py — Text embeddings and Vector storage function
- requirements.txt — Project dependencies to run the application in streamlit public cloud
We are supporting two types of data sources in this project demonstration:
- Web URL-based text data
- Online PDF files
These two types contain a wide range of text data and are most frequent for many use cases. You can see the main.py python code below to understand the app’s user interface.
# import necessary libraries
import streamlit as st
import openai
import qanda
from vector_search import *
from utils import *
from io import StringIO
# take openai api key in
api_key = st.sidebar.text_input("Enter your OpenAI API key:", type='password')
# open ai key
openai.api_key = str(api_key)
# header of the app
_ , col2,_ = st.columns([1,7,1])
with col2:
col2 = st.header("Simplchat: Chat with your data")
url = False
query = False
pdf = False
data = False
# select option based on user need
options = st.selectbox("Select the type of data source",
options=['Web URL','PDF','Existing data source'])
#ask a query based on options of data sources
if options == 'Web URL':
url = st.text_input("Enter the URL of the data source")
query = st.text_input("Enter your query")
button = st.button("Submit")
elif options == 'PDF':
pdf = st.text_input("Enter your PDF link here")
query = st.text_input("Enter your query")
button = st.button("Submit")
elif options == 'Existing data source':
data= True
query = st.text_input("Enter your query")
button = st.button("Submit")
# write code to get the output based on given query and data sources
if button and url:
with st.spinner("Updating the database..."):
corpusData = scrape_text(url)
encodeaddData(corpusData,url=url,pdf=False)
st.success("Database Updated")
with st.spinner("Finding an answer..."):
title, res = find_k_best_match(query,2)
context = "\n\n".join(res)
st.expander("Context").write(context)
prompt = qanda.prompt(context,query)
answer = qanda.get_answer(prompt)
st.success("Answer: "+ answer)
# write a code to get output on given query and data sources
if button and pdf:
with st.spinner("Updating the database..."):
corpusData = pdf_text(pdf=pdf)
encodeaddData(corpusData,pdf=pdf,url=False)
st.success("Database Updated")
with st.spinner("Finding an answer..."):
title, res = find_k_best_match(query,2)
context = "\n\n".join(res)
st.expander("Context").write(context)
prompt = qanda.prompt(context,query)
answer = qanda.get_answer(prompt)
st.success("Answer: "+ answer)
if button and data:
with st.spinner("Finding an answer..."):
title, res = find_k_best_match(query,2)
context = "\n\n".join(res)
st.expander("Context").write(context)
prompt = qanda.prompt(context,query)
answer = qanda.get_answer(prompt)
st.success("Answer: "+ answer)
# delete the vectors from the database
st.expander("Delete the indexes from the database")
button1 = st.button("Delete the current vectors")
if button1 == True:
index.delete(deleteAll='true')
To check other code files, please visit the project’s GitHub repository. (Link: Click Here)
Deployment of the Q&A App on Streamlit Cloud
Streamlit provides a community cloud to host applications free of cost. Moreover, streamlit is easy to use due to its automated CI/CD pipeline features. To learn more about streamlit to build apps — Please visit my previous article I wrote on Analytics Vidya (Link: Click Here)
Industry Use-cases of Custom Q&A Applications
Adopt custom question-answering applications in many industries as new and innovative use cases emerge in this field. Let’s look at such use cases:
Customer Support Assistance
The revolution of customer support has begun with the rise of LLMs. Whether it’s an E-commerce, telecommunication, or Finance industry, customer service bots developed on a company’s documents can help customers make faster and more informed decisions, resulting in increased revenue.
Healthcare Industry
The information is crucial for patients to get timely treatment for certain diseases. Healthcare companies can develop an interactive chatbot to provide medical information, drug information, symptom explanations, and treatment guidelines in natural language without needing an actual person.
Legal Industry
Lawyers deal with vast amounts of legal information and documents to solve court cases. Custom LLM applications developed using such large amounts of data can help lawyers to be more efficient and solve cases much faster.
Technology Industry
The biggest game-changing use case of Q&A applications is programming assistance. tech companies can build such apps on their internal code base to help programmers in problem-solving, understanding code syntax, debugging errors, and implementing specific functionalities.
Government and Public Services
Government policies and schemes contain vast information that can overwhelm many people. Citizens can get information on government programs and regulations by developing custom applications for such government services. It can also help in filling out government forms and applications correctly.
Conclusion
In conclusion, we have explored the exciting possibilities of building a custom question-answering application using LangChain and the Pinecone vector database. This blog has taken us through the fundamental concepts, from an overview of the question-answering application to understanding the capabilities of the Pinecone vector database. Combining the power of OpenAI’s semantic search pipeline with Pinecone’s efficient indexing and retrieval system, we have harnessed the potential to create a robust and accurate question-answering solution with streamlit. let’s look at the key takeaways from the article:
Key Takeaways
- Large language models (LLMs) have revolutionized AI, enabling diverse applications. Customizing chatbots with personal or organizational data is a powerful approach.
- While general LLMs offer a broad understanding of language, tailored question-answering applications offer a distinct advantage over fine-tuned personalized LLMs dues to their flexibility and cost-effectiveness.
- By incorporating the Pinecone vector database, OpenAI LLMs, and LangChain, we learned how to develop a semantic search pipeline and deploy it on a cloud-based platform like streamlit.
Frequently Asked Questions
A: Pinecone is a scalable long-term memory vector database to store text embeddings for LLM-powered applications, while LangChain is a framework that allows developers to build LLM-powered applications.
A: Use Question-answering applications in customer support chatbots, academic research, e-Learning, etc.
A: LangChain allows developers to use various components to integrate these LLMs in the most developers-friendly way possible, thus shipping products faster.
A: Steps to build a Q&A application are Document loading, text splitter, vector storage, retrieval, and model output.
A: LangChain has the following tools: Document loaders, Document transformers, Vector stores, Chains, Memory, and Agents.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.