
RAG sounds simple — chunk documents, embed them, retrieve, generate. In practice, naive RAG fails in production in predictable ways. Here’s what I’ve learned building RAG systems for medical, legal, and compliance use cases.
Chunking strategy matters more than model choice
Most teams spend hours picking between GPT-4 and Claude while using fixed 512-token chunks. The chunking strategy — recursive, semantic, or document-aware — has a bigger impact on retrieval quality than the LLM you choose.
Hybrid search: sparse + dense
Pure vector search misses exact keyword matches. Pure keyword search misses semantic similarity. Combining both (pgvector + full-text search in PostgreSQL, or Pinecone with sparse vectors) consistently outperforms either alone.
Reranking is not optional for serious applications
Adding a Cohere or cross-encoder reranker after initial retrieval improved answer quality in the Metabolic MD project noticeably. The cost is minimal; the gain is significant.
Evaluate retrieval and generation separately
Use RAGAS or a custom eval set to score retrieval@k before you even write the generation prompt. If retrieval is broken, no prompt engineering will save you.
