RAG pipelines in production: chunking strategy, vector search, and retrieval quality evaluation
RAG (Retrieval-Augmented Generation) is the most common architecture for grounding LLM answers in internal data — but most first implementations only work well in demos, not in production. Chunking strategy affects recall; embedding model affects precision; without a retrieval evaluation pipeline, there is no way to know where the system is failing. The three most important technical decisions and how to measure quality before deployment.
RAG (Retrieval-Augmented Generation) performs well in demos because demos typically use a handful of carefully selected documents, simple queries, and no one measures recall or precision. In production, the situation is different: the knowledge base has thousands of documents of uneven quality, users ask questions in unexpected ways, and a retrieval miss means the LLM gives a wrong answer or hallucinates — with no warning.
The quality of a RAG pipeline depends on three independent layers: indexing (how documents are processed and stored in the vector store), retrieval (how relevant chunks are found at query time), and generation (how the LLM synthesizes an answer from the returned context). A failure at any layer degrades the output — but each layer has different symptoms and requires a different debugging approach.
Chunking strategy — the first and most important decision
Chunking is the process of splitting documents into smaller segments for indexing and retrieval. Chunking strategy directly affects recall: chunks too small lose context, chunks too large dilute the signal and waste tokens in the LLM context window. There is no universally correct chunk size — it requires experimentation on real data:
- Fixed-size chunking with overlap: split by token count (512, 1024) with 10–20% overlap to avoid cutting across important sentences. The simplest to implement but ignores document structure — a table or code block may be cut in the middle
- Semantic chunking: split at natural semantic boundaries — headings, paragraphs, sections. Better at preserving context but produces chunks of uneven size. Well suited to structured documents like technical documentation, FAQs, and policy documents
- Hierarchical chunking: index both parent chunks (large, for context) and child chunks (small, for precision). Retrieve using child chunks but return the parent chunk to the LLM — preserves broader context without diluting the retrieval signal
- Document-aware chunking: for PDFs with tables, code blocks, or images, extract and process each element separately. Tables should be converted to structured text before chunking — embedding raw HTML of a table typically produces poor retrieval results
A quick test to evaluate chunking: take 20 real user questions, manually find the document passage containing the answer, then check which chunk covers it. If the answer is split across two chunks — increase chunk size or add more overlap.
Embedding model — precision depends on this choice
The embedding model converts text to vectors for semantic similarity comparison. Model selection affects retrieval accuracy — and the best model on a general benchmark isn't necessarily the best model for your specific domain:
- Domain-specific vs general: models trained on broad text (like OpenAI's `text-embedding-3-large` or `e5-large`) work well across content types. For highly specialized documents (medical, legal, industrial engineering), a model fine-tuned on that domain typically delivers significantly better recall
- Non-English languages: multilingual models (multilingual-e5, BGE-M3) are required when the knowledge base contains non-English content. An English-only model will produce poor-quality embeddings for non-English text, leading to low recall regardless of how good the chunking is
- Dimensionality vs cost: 1536-dimensional vectors (text-embedding-3-large) offer better precision but require more storage and compute than 256 or 384 dimensions. For knowledge bases under 100K chunks, the storage difference is negligible — prioritize quality. For millions of chunks, the tradeoff deserves careful consideration
- Consistency: the embedding model used at indexing time and query time must be the same. Changing models after indexing requires re-embedding the entire knowledge base — choose deliberately before indexing begins
Vector search and hybrid retrieval
With good embeddings in place, efficient indexing and querying are needed. The two search types commonly used in production RAG have different characteristics and are typically combined:
- Dense retrieval (HNSW index): searches by semantic similarity — understands paraphrasing, synonyms, and questions with the same meaning expressed in different words. Weakness: poor at exact keyword matching — if a user asks for 'error code E-1042', dense retrieval may miss it if the code is rare in the embedding model's training data
- Sparse retrieval (BM25): keyword-based exact matching — excellent for product codes, proper nouns, and document reference numbers. No semantic understanding, but never misses an exact match
- Hybrid search: combining dense and sparse via Reciprocal Rank Fusion (RRF) or weighted sum. Most production RAG systems benefit from hybrid — the recall improvement over dense-only is typically 10–20% on real-world queries
- Metadata filtering before vector search: if the knowledge base contains multiple document types (manuals, policies, FAQs), filter by metadata (document_type, date, department) before searching to reduce the search space and increase precision. Weaviate, Qdrant, and Pinecone all support pre-filtering before HNSW search
Reranking — the forgotten layer
Vector search returns top-K results ranked by similarity score, but similarity score doesn't perfectly correlate with relevance for a specific query. A reranker is a cross-encoder model that re-evaluates each (query, chunk) pair and reorders them — slower than bi-encoder embedding but significantly more accurate:
- Cross-encoder reranker (Cohere Rerank, BGE-Reranker): takes top 20–50 results from vector search, reranks and keeps only top 3–5 for the LLM context. The ratio of relevant-in-context increases noticeably, reducing hallucinations caused by diluted context
- When to add reranking: if the LLM frequently responds 'I couldn't find that information' despite the knowledge base containing the answer, or if the answer doesn't use the most important chunk — a reranker typically fixes this. Latency overhead is 50–200ms depending on the model
Evaluation pipeline — you can't improve what you don't measure
The most commonly skipped part of a RAG implementation: a systematic evaluation pipeline. Without evaluation, every change (chunking, embedding, reranking) is guesswork — there is no way to know whether a change improved or degraded the system:
- RAGAS framework: evaluates RAG across four main metrics: Context Recall (does the retrieved context contain the answer?), Context Precision (which chunks in the context were actually used?), Faithfulness (does the LLM answer based on context, or hallucinate?), and Answer Relevancy (is the answer relevant to the question?)
- Golden dataset: a set of 50–200 (question, ground-truth answer) pairs created from real documents — used to evaluate the pipeline every time any component changes. LLMs can be used to automatically generate a golden dataset from documents, then reviewed by a human
- Offline vs online evaluation: offline evaluation against a golden dataset provides fast feedback during development. Online evaluation tracking thumbs up/down or follow-up question rates on real traffic is the most important signal of actual quality in production
Debugging order when RAG quality is poor: (1) Check Context Recall — if low, the problem is in chunking or embedding. (2) If Recall is fine but Faithfulness is low, the problem is in the LLM prompt or context that's too long. (3) If both are fine but users are unsatisfied, the issue is Answer Relevancy or mismatched expectations.
Real example: RAG for a technical support system
In the AI agent project for a 12-store retail chain, a RAG pipeline was built to answer staff questions about policies, procedures, and the product catalog. The first implementation used fixed-size 512-token chunks and OpenAI embeddings — Context Recall on the golden dataset reached 61%. After switching to hierarchical chunking and hybrid search (dense + BM25), Recall rose to 84%. Adding BGE-Reranker and reducing the context window from top-10 to top-4 after reranking brought Faithfulness from 71% to 89%, and the rate of questions users had to repeat dropped by 40%.
Conclusion
RAG is not a single component but a pipeline with multiple design decisions — chunking, embedding, retrieval, reranking, generation. Each must be designed based on the characteristics of the actual data and queries, not framework defaults. The evaluation pipeline is a non-negotiable component: if you can't measure it, you can't improve it. In KonexForge's AI & ML layer, RAG is one of the most frequently deployed use cases — and evaluation-first is the design principle from day one of every Pilot Build.