Engineering
RAG Is Not Vector Search
Vector search is step 3 of 8. Here's the full checklist for production RAG -- and what breaks when you skip steps.
You embedded your docs. You ran a query. Results came back. Ship it?
Not even close.
Vector search is one layer of a production RAG system. We counted eight. Most teams get three right and wonder why their chatbot hallucinates on page 2 of the PDF.
The checklist nobody gives you
We audited our own pipeline against what production RAG actually requires. Not the conference-talk version. The version where a customer uploads a 200-page compliance document in Korean and expects correct answers.
Here are the eight layers, in the order they break:
| # | Layer | What breaks without it |
|---|---|---|
| 1 | Chunking strategy | Sentences split mid-thought. Context lost at boundaries. |
| 2 | Query enhancement | Short queries miss relevant docs. “Cancel contract” doesn’t find “termination procedure”. |
| 3 | Reranking | Top-k results are noisy. LLM gets confused by irrelevant chunks. |
| 4 | Multimodal extraction | Tables become garbled text. Images disappear. PDF structure is gone. |
| 5 | Evaluation pipeline | No way to measure quality. No way to detect regressions. |
| 6 | Index tuning | Latency spikes at scale. Memory costs explode. |
| 7 | Incremental updates | Full re-index on every document change. Hours of downtime. |
| 8 | Monitoring | Quality degrades silently. Nobody notices until a customer complains. |
Most teams nail 1, 4, and 6. The other five are where production RAG lives or dies.
Layer 1: Chunking is not text.split()
Fixed-token chunking is the default in every tutorial. It’s also the first thing that breaks.
A 1500-character window doesn’t respect paragraph boundaries. It doesn’t know that a heading introduces a new topic. It definitely doesn’t understand that table row 3 belongs with rows 1 and 2.
We use three strategies in sequence:
- Structural: Detect headings, numbered sections, document hierarchy. Split at semantic boundaries.
- Agentic: LLM identifies chunk boundaries with topic labels and keywords. Expensive, but catches what rules miss.
- Mechanical: Fallback for flat text. Paragraph-based with sentence detection.
The key insight: chunking quality sets the ceiling for everything downstream. Bad chunks mean bad retrieval means bad answers. You can’t fix this with a better reranker.
Layer 2: The query is the problem
Users type short queries. “How do I cancel?” is 4 words. The actual answer lives in a paragraph about “contract termination procedures and notice periods.”
The cosine similarity between those two embeddings is low. Your vector DB returns the right document at position 7 instead of position 1. Or not at all.
Two approaches that work:
HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer with an LLM, embed that instead. The fake answer is closer in embedding space to the real document than the original query was.
Query: "How do I cancel?"
HyDE generates: "To cancel your subscription, navigate toaccount settings and select the termination option.A 30-day notice period applies."
-> Embed this instead of the original 4-word queryQuery Expansion: Generate 3 alternative phrasings. Search with all of them. Union the results.
Original: "How do I cancel?"Variant 1: "contract termination process"Variant 2: "subscription cancellation steps"Variant 3: "how to end service agreement"Both add latency (one LLM call). Both measurably improve recall. The tradeoff is worth it for any query where precision matters more than speed.
Layer 3: Reranking separates signal from noise
Vector search returns the top 30 results. Maybe 8 are relevant. The LLM needs the best 5.
Without reranking, you’re passing noise to the LLM and hoping it figures it out. Sometimes it does. Sometimes it hallucinates from an irrelevant chunk that happened to score 0.001 higher.
A reranker takes the query and each candidate document, scores their relevance as a pair, and re-sorts. Cross-encoder models are more accurate than the original embedding similarity because they see query and document together, not separately.
The implementation pattern:
1. Vector search: top_k=30 (over-fetch)2. Reranker scores each (query, doc) pair3. Return top 5 by reranker scoreThis is the single highest-ROI improvement you can make to an existing RAG pipeline. If you do nothing else from this list, add a reranker.
Layer 4: PDFs are not text files
A compliance document has tables, headers, page numbers, images with captions, and footnotes. pdf2text gives you a wall of characters where table columns are interleaved and headings are indistinguishable from body text.
The extraction quality determines everything. Garbage in, garbage out — no amount of downstream sophistication helps.
What actually works: dedicated PDF extraction (LlamaParse), vision-language models for scanned/image-heavy pages, and format-specific parsers for DOCX. The key is having fallback paths. When the structured parser fails, the VLM catches it.
Layer 5: If you can’t measure it, you can’t improve it
This is the layer most teams skip entirely. “It seems to work” is not a quality metric.
Three metrics that matter:
| Metric | What it measures | How |
|---|---|---|
| Faithfulness | Is the answer grounded in the retrieved context? | LLM judges claim-by-claim |
| Answer Relevancy | Does the answer address the question? | LLM scores 0-10 |
| Context Recall | Did retrieval find the right documents? | Compare retrieved vs ground truth |
Run these on a test set of 50-100 QA pairs. Automate it in CI. Every pipeline change gets a regression check.
Without this, you’re flying blind. You’ll ship a chunking change that improves one query type and silently breaks three others.
Layer 6: Index type matters at scale
HNSW with float32 vectors works great at 10K documents. At 1M documents with 1024-dimensional vectors, you’re looking at ~4GB of memory just for the vectors.
Quantization changes the math:
| Format | Bits/dim | Memory (1M x 1024d) | Recall loss |
|---|---|---|---|
| F32 | 32 | 4.0 GB | baseline |
| SQ8 | 8 | 1.0 GB | ~1% |
| SQ4 | 4 | 0.5 GB | ~3% |
| SQ1 | 1 | 0.125 GB | ~3% at 97% recall |
SQ8 is the right default. The 1% recall loss is invisible in practice, and you get 4x memory savings.
Layer 7: Documents change
Your customer uploads a new version of their policy document. What happens?
Without incremental updates, you re-embed and re-index everything. For a large corpus, that’s hours.
The correct approach: content-hash each document on ingest. On update, compare hashes. Only re-process changed documents. Delete old vectors, insert new ones.
This sounds obvious. Most RAG frameworks don’t do it.
Layer 8: Quality degrades silently
Day 1: your RAG pipeline returns great results. Day 30: the customer added 500 new documents and query quality dropped 15%. Nobody noticed because nobody was measuring.
Search quality monitoring means tracking MRR, nDCG, and Hit@K on a representative query set, on a schedule. When metrics drop below a threshold, alert.
This is the difference between “production RAG” and “demo RAG”.
The real question
When someone says “we built RAG,” ask which of the eight layers they implemented.
Vector search alone is layer 6. That’s the storage layer. It’s necessary but not sufficient, the same way a database is necessary but not sufficient for a web application.
Production RAG is the full stack: extraction, chunking, query enhancement, retrieval, reranking, generation, evaluation, and monitoring. Skip any layer and you’ll find out in production, from your users, at the worst possible time.
What we ship
At Schift, the Execution Pipeline handles layers 1-3 and 5-8 so your agent code stays clean. You write schift.search(query). We handle the chunking strategy, query enhancement, reranking, evaluation, and monitoring behind that call.
The details, if you want them:
- Three-stage chunking (structural + agentic + mechanical)
- HyDE and query expansion via
enhance=hyde|expand - LLM and cross-encoder reranking via
rerank=true - LlamaParse + VLM extraction for multimodal documents
- Faithfulness / relevancy / recall metrics via
/v1/eval/run - SQ8-default quantized HNSW with sub-300us search
- Content-hash incremental upsert
- MRR / nDCG / Hit@K drift monitoring
All of this runs on the managed cloud. Or self-host it. The framework is open source.
The eight layers aren’t optional. They’re what makes RAG actually work.