Let's Talk About RAGs
RAG stands for Retrieval-Augmented Generation. It's a way to make a model use your own data sources without having to fine-tune the thing yourself.
The process is simple:
- gather your data - in this case we are scraping some websites
- transform the data into embeddings which are numerical representations of the data
- put those embeddings into a vector data store/database - here we are using an in-memory data store
- create a prompt that tells the model how to answer your questions
When you ask the model about your data, it consults your vector data store to retrieve the answers.
This helps to put guardrails around the model so it only gives you answers related to the information you are interested in.
Since we are scraping websites about prompt engineering, LLM attacks & agents, the app will answer the question "What is prompt engineering?" with the correct answer.
If you ask it "What is the square root of pi?" it will say something like "I don't know the square root of pi".
Llama 3.1 definitely knows the square root of pi, but it won't give you the answer because you have asked it to answer your questions based on the data you added to your vector data store.
Can the model still hallucinate? Yep. But providing the data and and a custom prompt helps to prevent some of that.
Getting Started
This code is based on the excellent tutorial published by Ryan Ong on on datacamp.com.
We'll be using data scraped from Lilian Weng's blog (@lilianweng).
I'm using this app as a jumping-off point to eventually build a chat app for an LLM (stay tuned).
I like to copy/paste code and I know you do too. That's why I'm updating the orignal tutorial to make some of the instructions more explicit.
We will also be using a free embeddings package rather than OpenAi, so you can put your credit card back in your wallet.
Before we begin you will need to make sure you have Python installed on your machine.
You'll also need pip if you don't already have it installed.
You can find the pip installation instructions here.
I use conda to manage Python packages and environments. If you are on a mac you can download it with homebrew.
You'll also need to let your shell know about conda. I am using zsh so this is the code I run.
brew install --cask miniconda
conda init zsh
source ~/.zshrc
You can find more download options in the Anaconda docs.
You'll also need to download Ollama.
On a mac, you can install it with homebrew.
brew install ollama
Once it's installed you'll need to download Llama 3.1 to your laptop.
ollama pull llama3.1
If you want to ask Llama some questions just to test it out you can run this command and wait for the prompt in your terminal:
ollama run llama3.1
To exit Ollama type Ctrl + d
or /bye
.
Set Up Your Environment
Before we install any packages, we need to set up a Conda environment.
conda create -n llama_rag_env python=3.9
conda activate llama_rag_env
Now let's add the dependencies.
pip install langchain langchain_community scikit-learn langchain-ollama bs4 python-dotenv tiktoken sentence-transformers
Here's what we added:
- langchain / langchain_community / langchain-ollama : The core libraries to build your RAG pipeline and interact with Ollama.
- scikit-learn : Works with the SKLearnVectorStore for storing embeddings.
- sentence-transformers : Needed for local embeddings.
- bs4: For handling scraped HTML documents.
- python-dotenv : Loads environment variables from a .env file.
- tiktoken : Tokenizer used by LangChain for chunking text.
In this app we are using SKLearnVectorStore as our vector database. It is an in-memory data store. This isn't as heavy duty as Qdrant or Weaviate, but it's perfect for running our RAG locally.
SKLearnVectorStore won't persist our data between sessions. So every time you run the app, the data will be processed from scratch.
Now crank up your code editor and we'll create the files.
Our Env Variable
We only need one environment variable. It's the USER_AGENT
. You can put this in a .env
file like I did, or you can add it directly to bash.
When you scrape websites programmatically, using something like WebBaseLoader, most websites will check for a valid "User-Agent" header before allowing your request.
If you don't have a "User-Agent" header, you'll probably get refused.
Setting a "User-Agent" header makes our website requests appear to come from a normal browser or a well-defined bot rather than just anonymous traffic.
You can use this one.
#.env
USER_AGENT=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
The Document Loader
We need a way to scrape the urls and then chop up the data so we can store it in our vector database.
Let's create a file called document_loader.py
and add this code.
"""
document_loader.py
"""
"""
This line imports the load_dotenv function which helps us read
secret information from a file called .env
"""
from dotenv import load_dotenv
"""
This imports the os module which lets us work with
environment variables and files
"""
import os
"""
Try to load secret information from a file called .env
"""
load_dotenv()
"""
Get the USER_AGENT value from our secret file
this is like an ID card for our program
when it visits websites
"""
user_agent = os.getenv("USER_AGENT")
"""
Store this USER_AGENT where our program can use it
"""
os.environ["USER_AGENT"] = user_agent
"""
Import tools we need to download and process web pages
WebBaseLoader helps us download web pages
RecursiveCharacterTextSplitter helps us break
big texts into smaller pieces
"""
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
def load_and_split_documents(urls):
"""
This function does two main things:
1. Downloads web pages from a list of URLs
2. Breaks these web pages into smaller
chunks that are easier to work with
Args:
urls (list): A list of website addresses we want to download
Returns:
list: The web pages broken up into smaller pieces
For each website address (URL) in our list:
1. Create a WebBaseLoader for that URL
2. Use .load() to download the content
This creates a list of downloaded documents
"""
docs = [WebBaseLoader(url).load() for url in urls]
"""
The above step gives us a complex list (a list of lists)
This line flattens it into a simple list we can work with
For example: [[1,2],[3,4]] becomes [1,2,3,4]
"""
docs_list = [item for sublist in docs for item in sublist]
"""
Create a tool that will help us split documents into smaller pieces
- chunk_size=250 means each piece will be about 250 characters long
- chunk_overlap=0 means the pieces won't overlap with each other
"""
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=250,
chunk_overlap=0
)
"""
Use our text splitter to break the documents into smaller pieces
"""
doc_splits = text_splitter.split_documents(docs_list)
"""
Return our list of small document pieces
"""
return doc_splits
This function is going to scrape the sites, chunk the data and return our list of chunks.
Why do we chunk?
We need to split up the text (chunk it) into smaller pieces for several reasons.
- Memory management: Many AI models and text processing tools have limits on how much text they can handle at once. Most models have a maximum "context window" - splitting longer texts lets us process documents that would otherwise be too large.
- Processing Efficiency: Working with smaller chunks of text is often faster and uses less computer resources than processing one massive piece of text.
- Better Analysis: In many cases, analyzing text in smaller pieces can actually give you better results. It's a little like reading a book - it's easier to understand if you focus on one paragraph at a time rather than trying to absorb the whole chapter at once.
You'll notice we are setting chunk_size
to 250. That means each chunk will contain 250 characters.
What's an overlapping chunk?
You'll see in the code that we are setting chunk_overlap=0
in order to prevent overlapping.
We are telling the function to save discrete chunks without duplicating pieces of other chunks.
Let's say we have a sentence: "Forget it, Donny. You're out of your element!".
We'll set chunk_size to 10 (so each chunk is 10 characters) and chunk_overlap to 0 (no overlap). We'll get something like this:
Chunk 1: "Forget it,"
Chunk 2: " Donny. Yo"
Chunk 3: "u're out o"
Chunk 4: "f your ele"
Chunk 5: "ment!"
Each chunk is a maximum of 10 characters and no characters are repeated.
But if we set chunk_size to 10 and chunk_overlap to 2 (2 characters overlapping with the next chunk), we'll get something like this:
Chunk 1: "Forget it,"
Chunk 2: "t, Donny. "
Chunk 3: "y. You're "
Chunk 4: "'re out of"
Chunk 5: "of your el"
Chunk 6: "r element"
Using overlapping characters provides more repetition, which provides more context to the LLM.
When you want an LLM to answer questions about a document in a production app, you might use a chunk_overlap of 100+ characters to make sure you don't miss answers that might span the boundary between chunks.
Our app is super simple so we don't need to overlap characters.
The Document Retriever
Computers are smart but they can't directly understand words like "prompt engineering" or "LLM attacks" - they only understand numbers.
In order to make the LLM understand our text data, we need to transform it into numbers. We call these numerical representations of our data "vectors" or "embeddings".
The numbers we use for vectors aren't random - they are carefully chosen so that similar words and phrases end up with similar numbers.
Here's a super simplified example.
Imagine we could represent words with just 3 numbers.
- "apple" = [0.8, 0.2, 0.1]
- "banana" = [0.7, 0.3, 0.1]
- "cherry" = [0.6, 0.4, 0.3]
- "airplane" = [-0.5, 0.5, -0.4]
- "car" = [-0.4, 0.6, -0.5]
In this example, "apple" and "banana" have very similar numbers because they are related concepts. Similarly, "airplane" and "car" have similar numbers because they are also related concepts.
AI models have been trained on millions of text documents. And through this training they have learned to assign similar numbers to words and phrases that are used in similar contexts.
When you convert text to vectors, you are converting it to numbers based on the patterns the AI model learned from the training data.
We need to convert the text we scraped into vectors so the LLM can understand it.
In this app we are using a small pre-trained model called MiniLM to transform text to embeddings. We are going to store these embeddings in a vector store called SKLearnVectorStore.
The vector store can quickly figure out which text chunks are similar to a user's question based on their numerical representations.
We are going to create a function that returns a retriever that, given a user's question, will return the top 4 most relevant text chunks from our vector store.
Let's create a file called retriever.py
and add this code.
"""
retriever.py
"""
from sentence_transformers import SentenceTransformer
class LocalEmbeddings:
"""
This class converts text into numerical vectors
(embeddings) using a pre-trained AI model
It uses a free, open-source model that runs
locally on your computer
"""
def __init__(self, model_name="all-MiniLM-L6-v2"):
"""
Load a pre-trained model
by default uses a small but effective model called MiniLM
"""
self.model = SentenceTransformer(model_name)
def embed_documents(self, texts):
"""
Takes a list of texts and converts each one
into a numerical vector
Shows a progress bar since this might take a while
with many documents
"""
return self.model.encode(texts, show_progress_bar=True)
def embed_query(self, query):
"""
Converts a single search query into a numerical vector
No progress bar needed since it's just one piece of text
"""
return self.model.encode(query, show_progress_bar=False)
def create_retriever(doc_splits, api_key=None):
"""
This function sets up a system to find relevant
documents based on a search query
It works in 2 steps:
1. Convert all documents into numerical vectors using
LocalEmbeddings
2. Store these vectors in a simple database (SKLearnVectorStore)
that can quickly find similar texts
doc_splits: List of text chunks from your documents
Returns: A retriever that can find the 4 most relevant
document chunks for any query
Create the embedding converter
"""
embedding = LocalEmbeddings()
"""
Set up the vector database and return it as a retriever
"""
from langchain_community.vectorstores import SKLearnVectorStore
vectorstore = SKLearnVectorStore.from_documents(
documents=doc_splits,
embedding=embedding,
)
return vectorstore.as_retriever(k=4)
"""
k=4 means it returns the 4 most similar documents
"""
The LocalEmbeddings
class is a wrapper around the SentenceTransformer model. It's going to transform our text into vectors.
SentenceTransformer is a free, open-source model that runs locally on your computer. It uses the all-MiniLM-L6-v2 (MiniLM) model.
The embed_query
function is going to take a single query (like the user's question) and transform it into a vector. By encoding our documents and the user's question into the same "vector space", we can use math to figure out how close or far apart they are within that space.
The concept of vector space is important because it allows us to calculate the distance between two vectors within that space.
Similarity in vector space translates to semantic similarity in language. Vectors for related concepts end up close together in the vector space.
We are using the SKLearnVectorStore
as an in-memory database for our vectors. It is lightweight and useful for local development. But for production apps you might want to use a more robust vector database like Qdrant or Weaviate.
We call .as_retriever(k=4)
to return the 4 documents that are most similar to the user's question.
You can think of the retriever as the "brain" of our RAG app. It's the part that does the actual searching and retrieval of information.
The RAG Pipeline
Now we are ready to create our RAG pipeline. This script starts by loading and splitting up the documents we scraped. It creates our chunks.
Next we call create_retriever
to create our retriever. Internally it emebeds each chunk into vectors, stores them in a vector store, and returns a retriever that can find the most relevant chunks for any user's question.
There's a PromptTemplate
that we use to create a prompt for the model. This prompt is going to tell the model how to answer the user's question. This is similar to the instructions you give to ChatGPT, Claude or Cursor before you ask your question.
Here we are telling the model to use our documents to answer the user's question. We are also telling it how long the answer should be. To make prevent hallucinations we are telling the model to say "I don't know" if it can't find the answer in our documents.
We are going to use Llama 3.1 8B as our model. It's an oldie but a goodie at this point. You could use any model that is supported by Ollama for this app.
The RAGApplication
class is the main part of our RAG pipeline. It uses the retriever, prompt, and model to answer the user's question.
Finally, in the __main__
section, it all comes together. We are hardcoding questions since this is a proof-of-concept app. For each question, the app will fetch the relevant chunks and pass them to the LLM. The answer returned is based on the retrieved text.
Create a file called llama_rag.py
and add this code.
"""
llama_rag.py
"""
from langchain_ollama import ChatOllama
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from document_loader import load_and_split_documents
from retriever import create_retriever
from dotenv import load_dotenv
import os
load_dotenv()
"""
Define the prompt template for the language model
"""
prompt = PromptTemplate(
template="""You are an assistant for question-answering tasks.
Use the following documents to answer the question.
If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise:
Question: {question}
Documents: {documents}
Answer:
""",
input_variables=["question", "documents"],
)
"""
Initialize the LLM with Llama 3.1 model
"""
llm = ChatOllama(
model="llama3.1",
temperature=0,
)
"""
Combine the prompt and the LLM into a single chain
"""
rag_chain = prompt | llm | StrOutputParser()
class RAGApplication:
"""
RAG (Retrieval-Augmented Generation) application
for question-answering tasks.
"""
def __init__(self, retriever, rag_chain):
self.retriever = retriever
self.rag_chain = rag_chain
def run(self, question):
"""
Answers a question using retrieved
documents and the language model.
Args:
question (str): The question to answer.
Returns:
str: The generated answer.
"""
# Retrieve relevant documents
documents = self.retriever.invoke(question)
# Extract content from the retrieved documents
doc_texts = "\n".join([doc.page_content for doc in documents])
# Get the answer from the LLM
answer = self.rag_chain.invoke({"question": question, "documents": doc_texts})
return answer
# Main script execution
if __name__ == "__main__":
# URLs to load documents from
urls = [
"https://lilianweng.github.io/posts/2023-06-23-agent/",
"https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
"https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",
]
# Load and split documents
doc_splits = load_and_split_documents(urls)
# Create the retriever
api_key = os.getenv("OPENAI_API_KEY") # Replace with your OpenAI API key
retriever = create_retriever(doc_splits, api_key)
# Initialize the RAG application
rag_application = RAGApplication(retriever, rag_chain)
# Example question
question = "What is prompt engineering?"
# Run the RAG application
answer = rag_application.run(question)
question2 = "What are types of attacks on LLMs?"
answer2 = rag_application.run(question2)
question3 = "What is the square root of pi?"
answer3 = rag_application.run(question3)
# Print the result
print("Question:", question)
print("Answer:", answer)
print("Question2:", question2)
print("Answer2:", answer2)
print("Question3:", question3)
print("Answer3:", answer3)
Running the App
And now let's see if this things works. Hop into your terminal. Make sure you are in the same directory as your llama_rag.py
file and make sure you have activated your llama_rag_env
conda environment.
Run this command:
python llama_rag.py
You should see something like this in your terminal (ignore my code highlighter - it is confused):
Batches: 100%|███████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 9.25it/s]
Question: What is prompt engineering?
Answer: Prompt engineering refers to methods for communicating with Large Language Models (LLMs) to steer their behavior towards desired outcomes without updating the model weights. It's an empirical science that requires experimentation and heuristics, aiming for alignment and model steerability. The goal is to optimize believability in a given context.
Question2: What are types of attacks on LLMs?
Answer2: There are five types of adversarial attacks on LLMs, including Token Manipulation, Gradient based Attacks, Jailbreak Prompting, Humans in the Loop Red-teaming, and Model Red-teaming. These attacks aim to manipulate the model's output by providing input that is slightly different from the original input. The attackers may have access to an API-like service or have full knowledge of the model's architecture.
Question3: What is the square root of pi?
Answer3: I don't know the square root of pi.
What's Next?
While this is cool, a console app isn't super useful. For my next project I'm going to give this thing a real database and a chat interface.
In the meantime you can find all this code on github here. If you make something cool with it, let me know: @semiprocoder.