Engineering
LongMemEval: 96% R@5 with our own stack
We ran the LongMemEval conversation memory benchmark on Schift Engine with our own embedding model. No ChromaDB, no external dependencies. Here's what worked and what didn't.
MemPalace recently posted 96.6% Recall@5 on LongMemEval, a benchmark that tests whether a retrieval system can find the right conversation session from a haystack of dozens. They use ChromaDB with all-MiniLM-L6-v2.
We wanted to know how our stack compares. Not ChromaDB. Not someone else’s embedding model. Schift Engine and schift-embed-1, end to end.
The result: 96.0% R@5 on pure vector search. Close, but not the point. The point is what we learned running seven different retrieval strategies on the same dataset.
The benchmark
LongMemEval gives you a question, a pile of conversation sessions (the haystack), and ground truth: which session contains the answer. The task is retrieval, not generation. Find the right session.
We sampled 100 questions balanced across six categories: knowledge-update, temporal-reasoning, multi-session, single-session-user, single-session-assistant, and single-session-preference.
For each question, we create a fresh collection in Schift Engine, embed the haystack sessions with schift-embed-1 via embed.schift.io, run the query, measure Recall@K and NDCG@10, then tear down the collection.
What we tested
Seven modes. Same dataset, same engine, same embedding model. Only the retrieval strategy changes.
| Mode | What it does |
|---|---|
| vector | Pure vector search on user turns (L1) |
| L# Cache | 3-tier hierarchy: L0 (full session), L1 (user turns), L2 (lead turns). Weighted merge. |
| vector+NLI | Vector search + NLI-inferred graph edges + GraphSearch expansion |
| L#+NLI | L# Cache docs + NLI edges + GraphSearch |
| temporal | Vector search with Engine’s temporal filter (AS_OF question date) |
| 128d / 384d | Dimension ablation on pure vector mode |
The numbers
All runs use schift-embed-1 at 1024 dimensions unless noted.
| Mode | R@1 | R@5 | R@10 | NDCG@10 |
|---|---|---|---|---|
| L# Cache | 88% | 96% | 98% | 0.923 |
| vector | 85% | 96% | 98% | 0.904 |
| vector+NLI | 85% | 96% | 98% | 0.904 |
| vector (384d) | 79% | 94% | 99% | 0.880 |
| vector (128d) | 71% | 93% | 96% | 0.827 |
| L#+NLI (graph) | 80% | 93% | 96% | 0.866 |
| temporal | 82% | 93% | 95% | 0.874 |
The key number: L# Cache hits 88% R@1 versus 85% for plain vector. When you need the right answer in the first result, the hierarchy matters.
For comparison, MemPalace reports 96.6% R@5 with ChromaDB + all-MiniLM-L6-v2 (384d). Our pure vector search lands at 96.0%. The gap is 0.6 percentage points.
What is L# Cache
Three levels of the same conversation, stored as separate vectors with a level metadata tag:
- L0: Full session text (user + assistant turns). Maximum context, maximum noise.
- L1: User turns only. Strips assistant verbosity. This is what plain vector mode uses.
- L2: First three user turns. A zero-cost summary proxy. No LLM call.
At query time, we search each level independently, then merge with weights: 0.5 * L1 + 0.3 * L2 + 0.2 * L0. L1 carries the most weight because user intent is the strongest retrieval signal. L2 adds recall breadth. L0 breaks ties.
The storage cost is 3x vectors. The search cost is three filtered queries instead of one. In this benchmark with 45 sessions per question, the overhead is negligible. At scale with thousands of sessions, L2 becomes a coarse filter that avoids scanning the full L0/L1 index.
What didn’t work
NLI-inferred graph edges. We used our fine-tuned NLI classifier (based on nli-deberta-v3-xsmall, 22M params, runs on CPU) to detect contradiction and entailment between session summaries, then created CONTRADICTS and SUPERSEDES edges in the Engine’s knowledge graph. GraphSearch expanded results along these edges.
The result was worse. R@5 dropped from 96% to 93%. The graph boost pulled in related-but-wrong sessions. On a 45-session haystack, the graph is too dense and the edges are too noisy to help.
This doesn’t mean graph search is useless. It means the signal-to-noise ratio matters. NLI on short lead-turn summaries produces edges that are semantically correct but not retrieval-relevant. A contradiction between two sessions doesn’t mean one is the answer to the question.
Temporal filtering. Schift Engine supports temporal queries natively: TEMPORAL_AS_OF, TEMPORAL_BEFORE, TEMPORAL_BETWEEN. We used AS_OF(question_date) for temporal-reasoning questions, hoping to exclude future sessions and boost recency.
Temporal-reasoning R@5 dropped from 81% to 63%. The filter was too aggressive. Some ground-truth sessions had dates that didn’t align cleanly with the question date, so the correct session got filtered out entirely.
Temporal filtering needs a more nuanced approach: soft recency boosting rather than hard cutoffs.
Dimension tradeoffs
We tested four embedding dimensions from the same model (schift-embed-1 supports Matryoshka dimensions).
| Dim | R@1 | R@5 | R@10 | NDCG@10 |
|---|---|---|---|---|
| 2048 | 82% | 95% | 98% | 0.888 |
| 1024 | 85% | 96% | 98% | 0.904 |
| 384 | 79% | 94% | 99% | 0.880 |
| 128 | 71% | 93% | 96% | 0.827 |
1024d is the sweet spot. Higher dimensions didn’t help; the extra capacity likely captures noise rather than signal for this task. 384d is surprisingly competitive at R@10 (99%), suggesting the ranking is slightly different but the correct session is still nearby. 128d is viable for edge/on-device scenarios where you need 93% R@5 in a fraction of the memory.
By question type
Where things break down.
| Type | R@5 (vector 1024d) | R@5 (L# Cache) |
|---|---|---|
| knowledge-update | 100% | 100% |
| multi-session | 100% | 100% |
| single-session-assistant | 100% | 100% |
| single-session-preference | 100% | 100% |
| single-session-user | 94% | 94% |
| temporal-reasoning | 81% | 81% |
Four categories are solved. The remaining gap is temporal-reasoning (81%) and single-session-user (94%). Temporal questions require understanding time sequences. Pure vector similarity doesn’t encode “before” and “after.” The temporal filter approach we tried was too blunt. A better path might be encoding temporal distance into the score at rerank time rather than filtering at search time.
What we’d do differently
- Soft temporal reranking instead of hard temporal filters. Multiply the vector score by a recency decay factor rather than excluding sessions entirely.
- Selective graph expansion. Instead of boosting all graph neighbors, only expand along SUPERSEDES edges for knowledge-update questions. The edge type should match the question type.
- Test at scale. The 45-session haystack is small. L# Cache and graph search are designed for thousands of sessions where brute-force vector search starts to degrade. This benchmark doesn’t stress that scenario.