Why Your RAG System Sucks (And How to Fix It)
I’ve reviewed dozens of RAG implementations. Most of them suck. Not because the teams are bad — but because RAG has deceptively simple tutorials and brutally hard production requirements.
Here are the 7 failures I see over and over, and exactly how to fix each one.
1. Your Chunks Are Wrong
The problem: Fixed-size chunking (e.g., 500 tokens) splits sentences mid-thought, separates headers from content, and puts table rows in different chunks.
The fix: Use document-aware recursive chunking:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "],
)
Better yet, use semantic chunking — split when the embedding similarity between consecutive sentences drops below a threshold.
2. You’re Not Re-ranking
The problem: Vector similarity search returns the top-K closest chunks, but “closest embedding” ≠ “most relevant answer.”
The fix: Add a cross-encoder re-ranker after retrieval:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
scores = reranker.predict([(query, chunk.text) for chunk in results])
reranked = sorted(zip(results, scores), key=lambda x: -x[1])
This alone improves answer quality by 20-30% in my experience.
3. You’re Using Pure Vector Search
The problem: Vector search misses exact keyword matches. If a user asks “What is policy ABC-123?”, semantic search might return policies ABC-124 and ABC-122 instead.
The fix: Hybrid search — combine BM25 keyword search with vector similarity:
vector_results = pinecone.query(embedding, top_k=20)
keyword_results = elasticsearch.search(query, top_k=20)
merged = reciprocal_rank_fusion(vector_results, keyword_results)
Hybrid search outperforms either approach alone in every benchmark I’ve tested.
4. You Have No Evaluation Pipeline
The problem: You shipped RAG to production and have no idea if it’s working well.
The fix: Set up automated evaluation with RAGAS or custom metrics:
- Faithfulness — does the answer only use facts from the context?
- Relevancy — are the retrieved chunks relevant to the question?
- Answer correctness — is the final answer accurate?
Run this on a golden dataset of 100+ question-answer pairs weekly.
5. You’re Stuffing Too Much Context
The problem: Retrieving 20 chunks and stuffing them all into the prompt. The LLM gets confused, latency spikes, and costs balloon.
The fix: Retrieve 20, re-rank, and pass only the top 3-5 most relevant chunks. Less is more.
6. You’re Not Tracking Metadata
The problem: All chunks look the same — no source, date, or category information.
The fix: Attach metadata to every chunk at indexing time:
{
"text": "chunk content...",
"source": "policy-handbook-v3.pdf",
"page": 42,
"section": "Data Retention",
"last_updated": "2025-06-15",
"document_type": "policy"
}
This enables filtered retrieval — search only within specific document types or date ranges.
7. You’re Not Handling “I Don’t Know”
The problem: When the answer isn’t in the documents, the LLM hallucinates a plausible-sounding response.
The fix: Add a confidence gate:
system_prompt = """
Answer the question using ONLY the provided context.
If the context doesn't contain enough information, respond with:
"I don't have enough information to answer this question."
Never make up information.
"""
Also: track how often users get “I don’t know” responses. If it’s > 15%, your document coverage has gaps.
The Bottom Line
A production RAG system isn’t a vector database + GPT-4. It’s a retrieval engineering problem that requires chunking strategy, hybrid search, re-ranking, evaluation, and continuous monitoring.
Get these 7 things right and your RAG system goes from “cool demo” to “business-critical tool.”