RAG that survives production
Most RAG demos melt the second a real user touches them. Here's the minimum viable eval + guardrail setup we use on every project.
We've shipped about thirty RAG systems over the last eighteen months. Most were fine. A few were terrible. The difference wasn't the model — it was whether the team put the boring plumbing in place before launch. This is our cheat sheet for that plumbing.
The minimum viable RAG stack
You don't need a vector DB with seventeen knobs. For most projects, this is enough:
- Postgres with the pgvector extension. One less service. Good enough to 10M documents.
- A chunker that respects document structure (paragraphs, headings) instead of blindly splitting by tokens.
- An embedding model you can swap. Start with a hosted one, re-embed when you switch.
- Hybrid search: vector similarity + BM25. Dense alone misses rare keywords.
- A re-ranker for the top 20 results. Cheap, enormous quality win.
That's it. You can add HyDE, query rewriting, graph retrieval, and the rest later — once you have evals telling you they help.
Evals you actually need
“We eyeballed it and it looked good” is how RAG systems end up with users yelling at support. You need three evals on day one:
- 1Retrieval eval: given a question, is the right chunk in the top-k? Build it from 50–200 human-labeled examples.
- 2Answer eval: given the retrieved chunks, is the final answer faithful? LLM-as-judge with a rubric works if you validate it against humans on a sample.
- 3Refusal eval: when the docs don't answer the question, does the system refuse cleanly? This one catches more hallucinations than the other two combined.
Run them in CI. Block the merge if quality drops by more than a few percent. It feels overkill in month one and will save your product in month six.
Guardrails that matter
Most RAG guardrails people build are theater. These are the ones that pay off:
- Always cite the source chunk(s) in the response. Hallucinations become obvious.
- If retrieval score is below a threshold, refuse instead of answering. “I don't know” is a feature.
- Log every query with the retrieved chunks. You need a trail when a user complains.
- Rate-limit per user. One overly-curious enterprise customer can eat a month's inference budget.
Monitoring
Hook up the three things you'll actually want to look at:
- 1Per-query latency, broken out by retrieval / rerank / generation.
- 2Per-query cost. You'll be shocked by the long tail of expensive queries.
- 3User feedback (thumbs up/down) stored next to the query and retrieved chunks. This becomes your next eval set.
When to reach for agents
Almost never. Most “agent” problems are really RAG plus a tool call. Add the tool call, keep the orchestration deterministic. You get debuggability, predictable cost, and a system you can explain to your CEO.
The places agents earn their complexity: multi-step research where the next step genuinely depends on the last, and workflows where a human can correct mid-flight. Everything else, keep it boring.
Got a project like this?
Tell us in one paragraph. A real engineer replies within a day.
Keep reading
What we actually mean by “vibe coding”
It's not vibes instead of rigor. It's vibes plus AI plus senior taste, so the boring 80% of work gets compressed into a morning.
How we ship an MVP in three weeks (and why you should)
A play-by-play of our 21-day MVP process: what we cut, what we keep, and how we keep quality from collapsing.