Simple Retrieval-Augmented Generation (RAG) Python Example

Retrieval-Augmented Generation (RAG) is one of the more practical ways to give large language models access to knowledge they weren’t trained on. It’s a simple idea with a lot of power behind it: instead of relying only on what the model “knows,” you first retrieve relevant context from your own documents and let the model use that information to generate better answers.

This post walks through a working example using plain text files, a local vector database, and an OpenAI-compatible LLM endpoint. It’s meant to be clear, minimal, and useful — something you can actually build on or adapt to your own projects.

What is RAG, in Plain Terms?

The RAG pattern combines search and generation. Here’s the basic process:

Store your source documents in a vector database.
When a user asks a question, convert it into an embedding.
Search for the most relevant documents using cosine similarity.
Add those results as context to the LLM’s prompt.
Generate the answer based on that retrieved information.

That’s it. No complex indexing, no multi-stage pipelines — just retrieval and generation tied together in a straightforward way.

The Setup

In this example, I dropped three small text files into a project folder:

john.txt: John is a cool guy who lives in Connecticut and likes pizza.
sarah.txt: Sarah is John’s girlfriend and likes horses.
humphry.txt: Humphry lives in London.

After running the ingestion step, I started an interactive session and asked the model:

Query: Tell me everything you know relating to John.

1. Location: John lives in Connecticut.
2. Personality: John is described as a "cool guy."
3. Interests: John likes pizza.
4. Relationships: John has a girlfriend named Sarah.

The model used not just the file about John, but also the file about Sarah to pull together a more complete answer. It connected the dots without needing explicit structure.

Here’s another query:

Query: Who does John know?

John has a girlfriend named Sarah. Therefore, John knows Sarah.

It’s a basic example, but it demonstrates how even lightweight context can improve a model’s response.

How It Works

This setup uses three main scripts:

ingest.py: Loads entire text files into a vector store with their embeddings.
ingest_chunk.py: Optionally splits longer documents into overlapping chunks before ingesting.
infer.py: Runs an interactive Q&A loop — embed the query, retrieve context, call the LLM, return the answer.

Embeddings come from BAAI/bge-large-en, and the vector search uses cosine similarity. The LLM endpoint can be anything OpenAI-compatible — local or cloud-based.

Results are filtered by a relevance threshold, and the distances are shown so you can see what the system is matching on and how closely.

Use Cases

This type of RAG setup is a good fit for:

Learning how retrieval and prompting work together
Small knowledge bases or internal reference tools
Experimenting with open-source LLMs
Augmenting models with up-to-date or domain-specific information

It’s intentionally lightweight. That makes it easier to follow and easier to modify. If you want to extend it to PDFs, structured data, or a front end, you’ve got a clean starting point.

Wrap-Up

RAG doesn’t need to be complicated. With a few scripts and some text files, you can build a system that makes your LLM a lot more useful. This project shows the core ideas in action without adding unnecessary layers.

The code, documentation, and sample files used are available here:

https://github.com/AightBits/simple_rag

If you’re exploring how to give models access to your own data, this should help you get started.

AightBits