RAG

What Is RAG and How to Build It End to End: A Practical Architecture Guide

What Is RAG and How to Build It End to End: A Practical Architecture Guide

When you ask an AI “how many days of annual leave do I get under our company policy?”, you don’t want it to invent an answer from memory — you want it to look at your actual document and answer from there. That is exactly what RAG does: first find the relevant document, then write the answer by looking at it. In this article we’ll explain RAG intuitively, then build the whole pipeline — from document to answer — end to end, with small code examples.

What is RAG? (The open-book exam)

RAG stands for “Retrieval-Augmented Generation.” It sounds complicated, but the idea is remarkably simple.

First find the relevant documents, then write the answer by looking at those documents.

Picture two kinds of students. The first sits the exam relying solely on memory: knows most things, but makes things up with confidence when memory fails. The second sits an open-book exam: first finds the relevant page in the source material, then writes the answer by looking at it — and can say where it found it. A plain large language model is the first student; RAG is the second.

This simple distinction solves an important problem: hallucination, where the model confidently invents information that isn’t real. Because RAG anchors the answer to real text placed in front of the model, it both improves accuracy and makes the question “where did you find this?” answerable.

The five steps of the pipeline

A RAG system runs in two phases. Preparation time (offline): you process the documents once and make them searchable. Query time (online): when a user asks something, you retrieve the relevant pieces and generate the answer.

  1. Document: a PDF, a contract, a wiki page, an email archive — the raw source.
  2. Chunk: you split the long document into meaningful, smaller pieces.
  3. Embedding: you turn each piece into a list of numbers (a vector) that represents its meaning.
  4. Vector search: you embed the question too, and find the pieces closest in meaning.
  5. Generation with context: you hand those pieces to the model alongside the question and have it write the answer.

It helps to see the flow in a single diagram:

Preparation (once):
  documents ──► split ──► embed ──► write to vector database

Query (every question):
  question ──► embed ──► fetch top-K closest chunks
           └─► [chunks + question] ──► LLM ──► sourced answer

Document → chunk: the art of splitting

Why not just hand over the whole document? Because the context window is limited and retrieval precision matters. Instead of fetching all 80 pages of a contract, fetching the single paragraph that contains the answer is both cheaper and more accurate. Chunking is the act of defining “the smallest meaningful unit you can search for.”

Practical tips: split at paragraph or heading boundaries, leave a small overlap between chunks so context broken mid-sentence is preserved, and attach each chunk’s source (file name, page, heading) as metadata — you need this to cite sources in the answer.

def split(text, size=800, overlap=120):
    chunks, i = [], 0
    while i < len(text):
        chunks.append(text[i:i + size])
        i += size - overlap   # overlap with the previous chunk
    return chunks
Tip: Tune chunk size to the document type. For dense technical text, small chunks (300–600 characters) improve precision; for flowing prose, larger chunks (800–1200) preserve context better. There is no “correct” size — you find it by measuring.

Embeddings and vector search

An embedding turns a piece of text into a list of numbers that represents its meaning. “Removal from the home” and “eviction” are different words, but because they’re close in meaning, their vectors end up close too. Classic keyword search can’t match those two; semantic search does exactly that. This is the heart of RAG.

You write the chunk vectors into a vector database (such as FAISS, pgvector, or Qdrant). When a query arrives, you embed the question with the same model and fetch the top-K closest chunks (say, 4–6) from the database. Closeness is usually measured with cosine similarity.

# Conceptual flow (pseudocode)
db = VectorDB()
for chunk in all_chunks:
    db.add(embed(chunk.text), metadata=chunk.source)

def retrieve(question, k=5):
    q = embed(question)
    return db.nearest(q, k)   # the k most relevant chunks by meaning

One important rule: the embedding model you use to vectorize the documents must be the same one you use to vectorize the question. Vector spaces from different models are not comparable.

Generation with context (prompt design)

In the final step, you hand the retrieved chunks to the model along with the question. The critical part here is the instruction: tell the model to use only the provided context and to say “I don’t know” when the information isn’t there. That single sentence cuts hallucination dramatically.

SYSTEM = """Answer using ONLY the CONTEXT below.
If the answer isn't in the context, say "I couldn't find it in the documents."
Cite the source number."""

def answer(question):
    chunks = retrieve(question, k=5)
    context = "\n\n".join(f"[{i+1}] {c.text}" for i, c in enumerate(chunks))
    message = f"CONTEXT:\n{context}\n\nQUESTION: {question}"
    return llm(system=SYSTEM, user=message)   # returns a sourced answer

For the generation layer, a modern language model is enough; models like Claude (Anthropic), for instance, handle long context well and stay faithful to the source. Architecturally, the model is a swappable part — your embedding and retrieval pipeline stays the same while you change the generation model.

What to watch when going to production

Building a demo is easy; building a reliable system takes a little more care. The most common pitfalls are:

  • Source attribution: show which chunk each answer came from. Trust comes from traceability.
  • Evaluation: “looks good” isn’t enough. Build a test set of known question–answer pairs and measure accuracy.
  • Retrieval quality: when the answer is bad, the culprit is usually retrieval, not generation. Look at which chunks came back first.
  • Freshness: update the vectors as documents change; a stale index produces stale answers.

Key takeaways

  • RAG = first find the relevant document, then answer by looking at it; it reduces hallucination.
  • The pipeline has five steps: document → chunk → embedding → vector search → generation with context.
  • Chunking and retrieval quality are the two knobs that most determine answer quality.
  • Use the same embedding model for documents and questions; tell the model to say “I don’t know” when context is missing.
What’s the difference between RAG and fine-tuning the model?

Fine-tuning permanently changes the model’s behavior/style, but it’s expensive and slow for adding knowledge. RAG supplies knowledge from the outside, at query time; you don’t need to retrain to add a document. For most “answer over my own documents” scenarios, the right starting point is RAG.

How many chunks (K) should I retrieve?

4–6 chunks is usually a good start. Retrieve too few and the answer is incomplete; retrieve too many and noise rises while cost climbs. The right number is found by measuring on your own test set.

Do I really need a vector database — won’t a regular one do?

For small experiments, vectors stored in a file plus a simple similarity search are enough. As scale grows, tools like FAISS, pgvector, or Qdrant make semantic search fast and practical. The logic is the same; only scale and speed change.


In short, RAG transforms AI from a “daydreaming” narrator into a “source-citing” researcher. In any field where accuracy matters, that is exactly where the difference lies. If you’d like to see this architecture come to life in a real product, take a look at İçtiHub — and you can read about the approach behind it on the EcoFluxion page.

İsmail Tarık Şenkal

EcoFluxion Teknoloji A.Ş. · Co-Founder

A developer and entrepreneur working on Turkish-focused AI products — the name behind EcoFluxion and İçtiHub.

← Previous
Tokens and Embeddings: How Does a Language Model "See" Text?