RAG Explained Simply: Retrieval Augmented Generation with Python Code

The Secret Sauce of Modern AI: RAG Explained Simply with Python

You’ve probably noticed that while ChatGPT and Claude are brilliant, they have a “memory” problem. If you ask them about a private document you wrote yesterday or a niche internal company policy, they start to hallucinate. They give you an answer that sounds confident but is factually untethered.

This is where Retrieval-Augmented Generation (RAG) changes the game.

Instead of relying solely on what the model learned during its initial training, RAG allows an AI to look up specific information from a reliable source before it speaks. Think of it like a student taking an open-book exam instead of trying to memorize the entire library.

What is Retrieval-Augmented Generation (RAG)?

At its core, Retrieval-Augmented Generation is a framework that connects a Large Language Model (LLM) to an external data source.

When you submit a query, the system doesn’t just send that text to the LLM. First, it searches a database for the most relevant “chunks” of information related to your question. It then hands those chunks to the LLM along with your original question, saying: “Use this specific data to answer the user.”

This process solves two massive headaches in the AI world:

Hallucinations: The model is grounded in facts you provide.
Data Freshness: You don’t need to retrain a multi-billion dollar model to update its knowledge; you just update your database.

Why RAG Beats Traditional Fine-Tuning

A year ago, everyone thought “Fine-Tuning” was the only way to make AI smarter on specific topics. We were wrong. Fine-tuning is like teaching a person a new language; it changes how they speak. RAG is like giving that person a specialized textbook; it changes what they know.

Feature

Fine-Tuning

Retrieval-Augmented Generation (RAG)

Cost

High (Compute intensive)

Low (Database storage/API calls)

Knowledge Update

Requires retraining

Instant (Update the source docs)

Transparency

Black box (Hard to verify source)

High (Can cite specific documents)

Hallucination Risk

Moderate

Low

The Three Pillars of a RAG Pipeline

Building a RAG system involves three main stages. I like to think of them as the “Ingestion,” the “Retrieval,” and the “Generation.”

1. Ingestion: Prepping the Data

You can’t just dump a 500-page PDF into a model. You have to break it down. We “chunk” the text into smaller pieces — maybe 500 words each — and then convert those pieces into embeddings.

Embeddings are essentially long lists of numbers (vectors) that represent the meaning of the text. If two sentences are numerically similar, they are contextually similar.

2. Retrieval: Finding the Needle in the Haystack

When a user asks a question, we convert that question into an embedding too. We then perform a mathematical “similarity search” against our database. We grab the top 3 or 5 chunks that most closely match the intent of the question.

3. Generation: The Final Polish

We take the user’s question and those retrieved chunks and wrap them in a prompt.

“Answer the following question: [User Question]. Use only the following context: [Retrieved Chunks].”

The LLM then generates a natural, human-sounding response based strictly on that context.

Building a Simple RAG System with Python

Let’s get our hands dirty. We’ll use LangChain, a popular library for this, and a simple vector store. For this example, imagine we have a text file containing “Secret Company Policies” that the AI wouldn’t otherwise know.

Python

import os

from langchain_community.document_loaders import TextLoader

from langchain_text_splitters import CharacterTextSplitter

from langchain_openai import OpenAIEmbeddings, ChatOpenAI

from langchain_community.vectorstores import Chroma

from langchain.chains import RetrievalQA

# 1. Load your private data

loader = TextLoader(“company_secrets.txt”)

documents = loader.load()

# 2. Split the text into manageable chunks

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

texts = text_splitter.split_documents(documents)

# 3. Create embeddings and store them in a vector database

# Note: You’ll need an OPENAI_API_KEY in your environment variables

embeddings = OpenAIEmbeddings()

db = Chroma.from_documents(texts, embeddings)

# 4. Set up the RetrievalQA chain

llm = ChatOpenAI(model_name=”gpt-4", temperature=0)

rag_chain = RetrievalQA.from_chain_type(

llm=llm,

chain_type=”stuff”,

retriever=db.as_retriever()

)

# 5. Ask a question!

query = “What is the policy on bringing llamas to the office?”

response = rag_chain.invoke(query)

print(response[“result”])

Breaking Down the Code

TextLoader: This pulls in your raw data.
CharacterTextSplitter: We split the data so we don’t hit “context window” limits.
Chroma: This is our vector database. It lives in memory for this example but can be saved to a disk.
RetrievalQA: This is the “brain” that connects the retriever to the LLM.

Real-World Examples of RAG in Action

I recently worked with a legal firm that had thirty years of case files. They didn’t want to train a model — they wanted to ask, “In which cases did we defend against copyright claims involving digital art in the 90s?”

By implementing a RAG pipeline, they could query their own history. The AI would pull up the three most relevant case files, summarize the outcomes, and even provide the file paths for the original documents.

Another common use case is Customer Support Bots. Instead of a bot saying “I don’t know,” it can scan your latest product manual in milliseconds and provide a step-by-step troubleshooting guide that was written only an hour ago.

The Challenges You’ll Face

RAG isn’t magic. It has its own set of hurdles.

Chunking Strategy: If your chunks are too small, they lose context. If they are too large, they dilute the specific answer you need. Finding the “Goldilocks” size is an art.

Quality of Embeddings: If your embedding model doesn’t understand the nuances of your industry (like medical or legal jargon), the “retrieval” part will fail. It will pull back irrelevant documents, leading to a garbage-in, garbage-out situation.

Vector Database Scaling: As you move from 100 documents to 1,000,000, you’ll need robust solutions like Pinecone, Weaviate, or Milvus to keep searches fast.

Frequently Asked Questions

Does RAG mean I don’t need to fine-tune my model?

Not necessarily. Fine-tuning is great for teaching a model a specific style or a very specialized vocabulary. However, for 90% of business use cases where you just want the AI to know your data, RAG is faster, cheaper, and more effective.

Can I use RAG with open-source models?

Absolutely. You can use models like Llama 3 or Mistral hosted locally via Ollama. Combine them with an open-source vector store like Chroma or FAISS, and you have a fully private RAG system that doesn’t send data to the cloud.

How do I stop the AI from answering questions outside my data?

You do this through “System Prompting.” You tell the LLM: “You are a helpful assistant. If the answer to the user’s question is not found in the provided context, politely state that you do not have that information.”

Is RAG secure for sensitive data?

It depends on your setup. If you use a local LLM and a local vector database, your data never leaves your infrastructure. If you use OpenAI’s API, you are sending the “retrieved chunks” to their servers, so you should ensure you have an enterprise agreement that covers data privacy.

RAG is the bridge between generic AI and specialized intelligence. It turns a general-purpose model into a subject matter expert on your specific world.

If you’re ready to stop dealing with AI hallucinations and start getting actual utility out of LLMs, building a RAG pipeline is your next step.

Would you like me to show you how to optimize the “chunking” process for better retrieval accuracy?