Skip to content

Retrieval-Augmented Generation

Ground language models in external knowledge instead of relying on parametric memory alone. RAG exists because models hallucinate, training data goes stale, and context windows — however large — remain finite.

Three forces push toward retrieval:

ForceProblem Without RAGRAG’s Answer
Knowledge freshnessTraining data has a cutoff dateRetrieve current documents at query time
GroundingModels confabulate plausible-sounding falsehoodsAnchor generation in source text
Context window limitsLarge corpora exceed even million-token windowsSelect only the relevant slices
Domain specificityGeneral models lack proprietary or niche knowledgeInject domain documents on demand

RAG does not eliminate hallucination. It reduces the surface area by giving the model evidence to cite rather than memory to fabricate from.

Every RAG system follows the same skeleton, however elaborate the implementation:

Ingest → Chunk → Embed → Store → Retrieve → Augment → Generate
StageInputOutputKey Decision
IngestRaw documentsClean textParsing fidelity (PDF, HTML, code)
ChunkClean textSized passagesStrategy and granularity
EmbedText passagesDense vectorsModel choice, dimensions
StoreVectors + metadataSearchable indexDatabase topology
RetrieveUser queryCandidate chunksSearch method, top-k, reranking
AugmentQuery + chunksAssembled promptContext packing, ordering
GenerateAugmented promptFinal answerCitation, faithfulness guardrails

Failure at any stage propagates forward. 80% of RAG failures trace to ingestion and chunking, not the model.

Chunking determines what the retriever can find. A chunk too large buries the signal in noise. A chunk too small strips the context that makes the signal interpretable.

StrategyHow It WorksStrengthsWeaknesses
Fixed-sizeSplit every N tokens with optional overlapSimple, predictable, fastCuts mid-sentence, ignores structure
Recursive characterSplit on paragraphs → sentences → words → charactersRespects natural boundariesStill structure-unaware for complex docs
SemanticGroup sentences by embedding similarityCaptures topic shiftsExpensive; needs minimum-size floor
Document-awareUse headings, sections, or markup as split pointsPreserves authorial structureRequires parser per format
Contextual (Anthropic)Prepend document-level context to each chunk before embedding35–67% fewer retrieval failuresAdds 50–100 tokens per chunk; preprocessing cost
  • Recursive character splitting is the safest starting point.
  • 256–512 tokens per chunk with 10–20% overlap covers most corpora.
  • Overlap provides no measurable benefit with sparse retrieval (SPLADE/BM25); test before defaulting to 20%.
  • Contextual chunking — prepending a short summary of the surrounding document — reduced retrieval failures by 49% in Anthropic’s benchmarks (67% with reranking). The preprocessing cost is roughly $1 per million document tokens.

The chunking configuration influences retrieval quality as much as the embedding model. Tune chunks before swapping models.

An embedding model compresses text into a dense vector where geometric proximity approximates semantic similarity. The vector captures meaning, not keywords.

DimensionWhy It Matters
Model qualityDetermines how well meaning maps to geometry
Vector dimensionsHigher = more expressive, but more storage and slower search
Max input tokensCaps chunk size; 8K is common, some models handle 32K
Domain fitGeneral-purpose models underperform on specialized corpora
MetricFormula IntuitionWhen to Use
CosineAngle between vectors (ignores magnitude)Default; works for normalized embeddings
Dot productCosine × magnitudesWhen magnitude encodes importance
Euclidean (L2)Straight-line distanceRarely better than cosine for text

Cosine similarity is the default for a reason: most embedding models normalize their output, making magnitude irrelevant.

Choose the store that matches your operational constraints, not the one with the most features.

CriterionpgvectorQdrantPineconeChroma
Best forAlready on PostgresComplex metadata filteringTurnkey managed scalePrototyping and learning
DeploymentExtension on existing DBSelf-hosted or cloudFully managedEmbedded or client-server
Scale ceiling~5M vectors before tuning neededBillions (Rust, HNSW)Billions (serverless)~5M before migration pressure
FilteringSQL WHERE clausesRich payload filtersMetadata filtersBasic metadata
Operational costZero new infrastructureModerate (another service)Pay-per-queryMinimal
TradeoffPerformance at scaleInfra to manageVendor lock-in, cost at volumeNot production-grade at scale
  1. Already run Postgres? Start with pgvector. Same transactions, same tooling, no sync pipeline.
  2. Need complex filters + scale? Qdrant or Milvus. Purpose-built for the workload.
  3. Want zero ops? Pinecone. Pay more, manage less.
  4. Building a prototype? Chroma. Embedded, no config, swap later.

The self-hosted break-even against managed serverless sits around 80–100 million queries per month. Below that threshold, managed services save engineering time.

The retriever determines what evidence the model sees. A mediocre generator with excellent retrieval outperforms an excellent generator with mediocre retrieval.

StrategyMechanismStrengthsWeaknesses
Similarity searchNearest neighbors in embedding spaceSimple, semantic matchingMisses exact keywords
BM25 (lexical)Term frequency–inverse document frequencyExact match, acronyms, proper nounsNo semantic understanding
Hybrid (dense + sparse)Combine similarity + BM25 via rank fusionBest of both; 49% fewer failuresTwo indexes to maintain
RerankingCross-encoder scores query–chunk pairs18–42% precision boost over retrieval aloneAdds latency; runs on top-N only
MMR (Maximal Marginal Relevance)Penalize redundancy in retrieved setDiverse resultsTuning lambda is fiddly

The strongest general-purpose configuration:

1. Retrieve top-150 from dense index (semantic recall)
2. Retrieve top-150 from BM25 index (lexical recall)
3. Fuse and deduplicate (rank fusion)
4. Rerank to top-20 with cross-encoder (precision)
5. Pack top-k into prompt (context assembly)

Anthropic’s benchmarks showed this pipeline reduces retrieval failures by 67% compared to naive embedding search alone.

Retrieved chunks are raw material. How you pack them into the prompt determines whether the model uses them well.

PrincipleRationale
Order by relevanceModels attend more to the beginning and end of context
Include source metadataEnables citation; “According to [doc, section]…”
Stay under 8K tokensShorter, precise context outperforms 50K token dumps
Separate context from instructionClear delimiter between retrieved text and task
You are answering questions using the provided context.
Only use information from the context. If the context does
not contain the answer, say so.
## Context
{retrieved_chunks_with_source_metadata}
## Question
{user_query}

The instruction to say “I don’t know” when context lacks the answer is a faithfulness guardrail. Without it, the model fills gaps from parametric memory — exactly what RAG was supposed to prevent.

RAG evaluation splits into two independent axes: did you retrieve the right evidence, and did you generate a faithful answer from it?

MetricMeasuresTarget
Recall@kFraction of relevant docs found in top-k> 0.8
Precision@kFraction of top-k that are actually relevant> 0.7
MRRReciprocal rank of first relevant result> 0.7
nDCG@kRanking quality weighted by position> 0.7

Precision below 70% signals a chunking or embedding problem. Fix retrieval before tuning generation.

MetricMeasuresHow to Assess
FaithfulnessAnswer grounded in retrieved contextLLM-as-judge or human
RelevanceAnswer addresses the actual questionLLM-as-judge or human
CompletenessAll relevant information from context is usedHuman evaluation
HallucinationClaims not supported by any retrieved chunkAutomated + spot checks
SignalAlert When
Faithfulness scoreDrops > 15% from baseline
Retrieval latencyp99 exceeds 500ms
End-to-end latencySimple RAG > 2s, agentic > 8s
Empty retrieval rate> 5% of queries return no chunks
Error rate spike> 25% increase over window

RAG adds infrastructure, latency, and failure modes. Use it only when the alternatives are worse.

SituationBetter Alternative
Corpus fits in context (< 200K tokens)Prompt stuffing with caching
Data is structured (tables, schemas)Text-to-SQL or direct API query
Answers require full-document reasoningLong-context model with complete document
Knowledge is static and smallFine-tune or few-shot examples
Exact lookup, not semantic searchTraditional database query or keyword search
Real-time transactional dataDirect database access, not a stale index

The question is not “can RAG solve this?” but “is RAG the simplest architecture that solves this?” If a SQL query or a cached prompt answers the question, RAG is accidental complexity.

ConcernProblemMitigation
StalenessIndex lags behind source documentsIncremental ingestion pipeline; track document hashes
ConsistencyDifferent chunks from different document versionsVersion metadata on chunks; atomic re-index per doc
CostEmbedding + storage + retrieval + generationCache frequent queries; batch embeddings; rerank fewer
LatencyRetrieval adds 100–500ms before generationPre-compute embeddings; use ANN with HNSW; cache hot queries
SecurityRetrieved chunks may contain sensitive dataChunk-level access control; filter by user permissions
ObservabilityBlack-box retrieval hides failure modesLog retrieved chunk IDs per query; track retrieval scores

The query “What are the tax implications of stock options?” shares little lexical or embedding overlap with the document that answers it. HyDE bridges this gap:

1. Ask the LLM to generate a hypothetical answer (no retrieval)
2. Embed the hypothetical answer (not the original query)
3. Use that embedding to retrieve real documents

The hypothetical answer, even if factually wrong, occupies the same embedding neighborhood as real answers. This shifts the search from query-space to answer-space, where documents actually live.

When it helps: Vague queries, exploratory questions, domain gaps between user language and document language.

When it hurts: Adds an LLM call before retrieval. Skip it when queries already match document vocabulary.

Complex questions often require evidence from multiple chunks that no single retrieval can surface:

"How does our auth system compare to the OWASP recommendations?"
Decomposed:
1. "What is our current auth system architecture?"
2. "What are the OWASP authentication recommendations?"
3. "Where do they diverge?"

Retrieve for each sub-query independently, then synthesize. The cost is multiple retrieval passes; the gain is evidence that a single query would miss.

The retriever becomes a tool the agent decides when and how to call:

Agent loop:
1. Analyze query → decide if retrieval is needed
2. Formulate retrieval query (may differ from user query)
3. Evaluate retrieved chunks → decide if sufficient
4. If insufficient → reformulate query or try different source
5. Generate answer from accumulated evidence

Agentic RAG trades latency for recall. The agent can retry, decompose, and cross-reference — capabilities that a single-pass pipeline lacks. The cost is unpredictable latency and harder evaluation.

Anti-PatternWhy It FailsFix
Chunk and prayNo evaluation; assumes retrieval worksMeasure recall@k and precision@k before shipping
Dump 50K tokens of contextDrowns signal; models attend poorly to mid-contextCap at 8K; rerank aggressively
Embed everything, filter nothingNoise outranks signalMetadata filters, access controls, reranking
Skip rerankingANN search optimizes speed, not precisionCross-encoder reranking on top-N
One chunk size for all document typesPDFs, code, and chat logs have different structureDocument-aware chunking per source type
No staleness detectionIndex answers from last month’s docsHash-based change detection; incremental re-index
RAG when context window sufficesAdded complexity for no retrieval benefitMeasure corpus size first; use prompt caching
Evaluate generation without evaluating retrievalTuning the generator when the retriever is brokenEvaluate each stage independently
  1. Fix retrieval before generation — a perfect model cannot compensate for missing evidence
  2. Start with hybrid search — dense + sparse retrieval outperforms either alone
  3. Rerank the top-N — approximate nearest neighbor optimizes for speed, not precision
  4. Measure before you tune — precision@k and recall@k diagnose which stage is failing
  5. Keep assembled context short — 8K tokens of precise evidence beats 50K of marginally relevant text
  6. Chunk configuration matters as much as model choice — tune chunk size and overlap before swapping embedding models
  7. Version your index — document hashes and metadata prevent stale answers
  8. Ask whether RAG is necessary — if the corpus fits in context, skip the pipeline
  • Agent Memory — Memory architectures where RAG serves as one retrieval layer
  • Orchestration — RAG as a step in agent workflows, context routing challenges
  • System Design — Where vector stores and retrieval pipelines fit in system architecture
  • Complexity — RAG adds accidental complexity; when the simpler approach wins