Architecture

RAG Architecture That Actually Works at Scale

May 2025/8 min read

RAG sounds simple — chunk documents, embed them, retrieve, generate. In practice, naive RAG fails in production in predictable ways. Here’s what I’ve learned building RAG systems for medical, legal, and compliance use cases.

01.

Chunking strategy matters more than model choice

Most teams spend hours picking between GPT-4 and Claude while using fixed 512-token chunks. The chunking strategy — recursive, semantic, or document-aware — has a bigger impact on retrieval quality than the LLM you choose.

02.

Hybrid search: sparse + dense

Pure vector search misses exact keyword matches. Pure keyword search misses semantic similarity. Combining both (pgvector + full-text search in PostgreSQL, or Pinecone with sparse vectors) consistently outperforms either alone.

03.

Reranking is not optional for serious applications

Adding a Cohere or cross-encoder reranker after initial retrieval improved answer quality in the Metabolic MD project noticeably. The cost is minimal; the gain is significant.

04.

Evaluate retrieval and generation separately

Use RAGAS or a custom eval set to score retrieval@k before you even write the generation prompt. If retrieval is broken, no prompt engineering will save you.

Usman GhaniFull-Stack Developer & AI Engineer

Building production-grade AI systems and web applications for international clients. 3+ years shipping end-to-end products across the US and Australia.

<- Previous Post

View All Posts

Next Post ->