You embedded your docs. You ran a query. Results came back. Ship it?

Not even close.

Vector search is one layer of a production RAG system. We counted eight. Most teams get three right and wonder why their chatbot hallucinates on page 2 of the PDF.

The checklist nobody gives you

We audited our own pipeline against what production RAG actually requires. Not the conference-talk version. The version where a customer uploads a 200-page compliance document in Korean and expects correct answers.

Here are the eight layers, in the order they break:

#	Layer	What breaks without it
1	Chunking strategy	Sentences split mid-thought. Context lost at boundaries.
2	Query enhancement	Short queries miss relevant docs. “Cancel contract” doesn’t find “termination procedure”.
3	Reranking	Top-k results are noisy. LLM gets confused by irrelevant chunks.
4	Multimodal extraction	Tables become garbled text. Images disappear. PDF structure is gone.
5	Evaluation pipeline	No way to measure quality. No way to detect regressions.
6	Index tuning	Latency spikes at scale. Memory costs explode.
7	Incremental updates	Full re-index on every document change. Hours of downtime.
8	Monitoring	Quality degrades silently. Nobody notices until a customer complains.

Most teams nail 1, 4, and 6. The other five are where production RAG lives or dies.

Layer 1: Chunking is not `text.split()`

Fixed-token chunking is the default in every tutorial. It’s also the first thing that breaks.

A 1500-character window doesn’t respect paragraph boundaries. It doesn’t know that a heading introduces a new topic. It definitely doesn’t understand that table row 3 belongs with rows 1 and 2.

We use three strategies in sequence:

Structural: Detect headings, numbered sections, document hierarchy. Split at semantic boundaries.
Agentic: LLM identifies chunk boundaries with topic labels and keywords. Expensive, but catches what rules miss.
Mechanical: Fallback for flat text. Paragraph-based with sentence detection.

The key insight: chunking quality sets the ceiling for everything downstream. Bad chunks mean bad retrieval means bad answers. You can’t fix this with a better reranker.

Layer 2: The query is the problem

Users type short queries. “How do I cancel?” is 4 words. The actual answer lives in a paragraph about “contract termination procedures and notice periods.”

The cosine similarity between those two embeddings is low. Your vector DB returns the right document at position 7 instead of position 1. Or not at all.

Two approaches that work:

HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer with an LLM, embed that instead. The fake answer is closer in embedding space to the real document than the original query was.

Query: "How do I cancel?"

HyDE generates: "To cancel your subscription, navigate to
account settings and select the termination option.
A 30-day notice period applies."

-> Embed this instead of the original 4-word query

Query Expansion: Generate 3 alternative phrasings. Search with all of them. Union the results.

Original: "How do I cancel?"
Variant 1: "contract termination process"
Variant 2: "subscription cancellation steps"
Variant 3: "how to end service agreement"

Both add latency (one LLM call). Both measurably improve recall. The tradeoff is worth it for any query where precision matters more than speed.

Layer 3: Reranking separates signal from noise

Vector search returns the top 30 results. Maybe 8 are relevant. The LLM needs the best 5.

Without reranking, you’re passing noise to the LLM and hoping it figures it out. Sometimes it does. Sometimes it hallucinates from an irrelevant chunk that happened to score 0.001 higher.

A reranker takes the query and each candidate document, scores their relevance as a pair, and re-sorts. Cross-encoder models are more accurate than the original embedding similarity because they see query and document together, not separately.

The implementation pattern:

1. Vector search: top_k=30 (over-fetch)
2. Reranker scores each (query, doc) pair
3. Return top 5 by reranker score

This is the single highest-ROI improvement you can make to an existing RAG pipeline. If you do nothing else from this list, add a reranker.

Layer 4: PDFs are not text files

A compliance document has tables, headers, page numbers, images with captions, and footnotes. pdf2text gives you a wall of characters where table columns are interleaved and headings are indistinguishable from body text.

The extraction quality determines everything. Garbage in, garbage out — no amount of downstream sophistication helps.

What actually works: dedicated PDF extraction (LlamaParse), vision-language models for scanned/image-heavy pages, and format-specific parsers for DOCX. The key is having fallback paths. When the structured parser fails, the VLM catches it.

Layer 5: If you can’t measure it, you can’t improve it

This is the layer most teams skip entirely. “It seems to work” is not a quality metric.

Three metrics that matter:

Metric	What it measures	How
Faithfulness	Is the answer grounded in the retrieved context?	LLM judges claim-by-claim
Answer Relevancy	Does the answer address the question?	LLM scores 0-10
Context Recall	Did retrieval find the right documents?	Compare retrieved vs ground truth

Run these on a test set of 50-100 QA pairs. Automate it in CI. Every pipeline change gets a regression check.

Without this, you’re flying blind. You’ll ship a chunking change that improves one query type and silently breaks three others.

Layer 6: Index type matters at scale

HNSW with float32 vectors works great at 10K documents. At 1M documents with 1024-dimensional vectors, you’re looking at ~4GB of memory just for the vectors.

Quantization changes the math:

Format	Bits/dim	Memory (1M x 1024d)	Recall loss
F32	32	4.0 GB	baseline
SQ8	8	1.0 GB	~1%
SQ4	4	0.5 GB	~3%
SQ1	1	0.125 GB	~3% at 97% recall

SQ8 is the right default. The 1% recall loss is invisible in practice, and you get 4x memory savings.

Layer 7: Documents change

Your customer uploads a new version of their policy document. What happens?

Without incremental updates, you re-embed and re-index everything. For a large corpus, that’s hours.

The correct approach: content-hash each document on ingest. On update, compare hashes. Only re-process changed documents. Delete old vectors, insert new ones.

This sounds obvious. Most RAG frameworks don’t do it.

Layer 8: Quality degrades silently

Day 1: your RAG pipeline returns great results. Day 30: the customer added 500 new documents and query quality dropped 15%. Nobody noticed because nobody was measuring.

Search quality monitoring means tracking MRR, nDCG, and Hit@K on a representative query set, on a schedule. When metrics drop below a threshold, alert.

This is the difference between “production RAG” and “demo RAG”.

The real question

When someone says “we built RAG,” ask which of the eight layers they implemented.

Vector search alone is layer 6. That’s the storage layer. It’s necessary but not sufficient, the same way a database is necessary but not sufficient for a web application.

Production RAG is the full stack: extraction, chunking, query enhancement, retrieval, reranking, generation, evaluation, and monitoring. Skip any layer and you’ll find out in production, from your users, at the worst possible time.

What we ship

At Schift, the Execution Pipeline handles layers 1-3 and 5-8 so your agent code stays clean. You write schift.search(query). We handle the chunking strategy, query enhancement, reranking, evaluation, and monitoring behind that call.

The details, if you want them:

Three-stage chunking (structural + agentic + mechanical)
HyDE and query expansion via enhance=hyde|expand
LLM and cross-encoder reranking via rerank=true
LlamaParse + VLM extraction for multimodal documents
Faithfulness / relevancy / recall metrics via /v1/eval/run
SQ8-default quantized HNSW with sub-300us search
Content-hash incremental upsert
MRR / nDCG / Hit@K drift monitoring

All of this runs on Schift Cloud. Enterprise customers can deploy the same stack on-prem (Docker Compose + Terraform) under contract with NDA source review.

The eight layers aren’t optional. They’re what makes RAG actually work.