15.8K readers

Why Your RAG System Sucks (And How to Fix It)

I’ve reviewed dozens of RAG implementations. Most of them suck. Not because the teams are bad — but because RAG has deceptively simple tutorials and brutally hard production requirements.

Here are the 7 failures I see over and over, and exactly how to fix each one.

1. Your Chunks Are Wrong

The problem: Fixed-size chunking (e.g., 500 tokens) splits sentences mid-thought, separates headers from content, and puts table rows in different chunks.

The fix: Use document-aware recursive chunking:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "],
)

Better yet, use semantic chunking — split when the embedding similarity between consecutive sentences drops below a threshold.

2. You’re Not Re-ranking

The problem: Vector similarity search returns the top-K closest chunks, but “closest embedding” ≠ “most relevant answer.”

The fix: Add a cross-encoder re-ranker after retrieval:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
scores = reranker.predict([(query, chunk.text) for chunk in results])
reranked = sorted(zip(results, scores), key=lambda x: -x[1])

This alone improves answer quality by 20-30% in my experience.

The problem: Vector search misses exact keyword matches. If a user asks “What is policy ABC-123?”, semantic search might return policies ABC-124 and ABC-122 instead.

The fix: Hybrid search — combine BM25 keyword search with vector similarity:

vector_results = pinecone.query(embedding, top_k=20)
keyword_results = elasticsearch.search(query, top_k=20)
merged = reciprocal_rank_fusion(vector_results, keyword_results)

Hybrid search outperforms either approach alone in every benchmark I’ve tested.

4. You Have No Evaluation Pipeline

The problem: You shipped RAG to production and have no idea if it’s working well.

The fix: Set up automated evaluation with RAGAS or custom metrics:

  • Faithfulness — does the answer only use facts from the context?
  • Relevancy — are the retrieved chunks relevant to the question?
  • Answer correctness — is the final answer accurate?

Run this on a golden dataset of 100+ question-answer pairs weekly.

5. You’re Stuffing Too Much Context

The problem: Retrieving 20 chunks and stuffing them all into the prompt. The LLM gets confused, latency spikes, and costs balloon.

The fix: Retrieve 20, re-rank, and pass only the top 3-5 most relevant chunks. Less is more.

6. You’re Not Tracking Metadata

The problem: All chunks look the same — no source, date, or category information.

The fix: Attach metadata to every chunk at indexing time:

{
    "text": "chunk content...",
    "source": "policy-handbook-v3.pdf",
    "page": 42,
    "section": "Data Retention",
    "last_updated": "2025-06-15",
    "document_type": "policy"
}

This enables filtered retrieval — search only within specific document types or date ranges.

7. You’re Not Handling “I Don’t Know”

The problem: When the answer isn’t in the documents, the LLM hallucinates a plausible-sounding response.

The fix: Add a confidence gate:

system_prompt = """
Answer the question using ONLY the provided context.
If the context doesn't contain enough information, respond with:
"I don't have enough information to answer this question."
Never make up information.
"""

Also: track how often users get “I don’t know” responses. If it’s > 15%, your document coverage has gaps.

The Bottom Line

A production RAG system isn’t a vector database + GPT-4. It’s a retrieval engineering problem that requires chunking strategy, hybrid search, re-ranking, evaluation, and continuous monitoring.

Get these 7 things right and your RAG system goes from “cool demo” to “business-critical tool.”