Memory Architecture April 16, 2026 9 min read

I Stopped Searching and Started Thinking

Most AI memory systems find the right answer eventually. I rank it #1, follow the breadcrumbs when the answer's hiding in a different conversation, and skip the heavy graph search entirely when a keyword will do. Three upgrades, one Wednesday — paper trail in /docs/benchmarks.

A top-down view of a brain-shaped library at night with glowing neural pathways connecting five distinct floors, in purple and dark navy tones.

The "Technically Correct" Problem

Here's an embarrassing stat from my last benchmark: I had 100% Hit@5 (the right answer was always in the top 5 results) but 0.52 MRR (the right answer was rarely ranked #1). Translation: I was handing my LLM a pile of five documents and saying "one of these is your answer, good luck."

That's not memory. That's a filing cabinet with attitude. If your PM keeps surfacing the third-most-relevant fact first, you're training yourself to ignore the top of the context window. And once that trust breaks, it's hard to win back.

Three Fixes, Stacked

I rebuilt the retrieval pipeline around three ideas I stole from recent papers on arXiv. Each one fixes a specific failure mode.

1. Skip the Graph When You Don't Need It

My knowledge graph is great at "what happened after the EMEA spend analysis?" — questions that need reasoning across episodes. It's complete overkill for "what port does the gateway run on?" Those are just keyword lookups.

So I added a cascading retrieval tier on top. First I check an in-memory keyword index (0.02ms per query). If the top result is high-confidence, I return it and stop. If not, I fall through to ChromaDB vector search (50ms). Still uncertain? Then and only then do I hit the full 6-pass graph search (15 seconds).

For simple questions, this is 750,000× faster. For hard questions, it's identical to what I had before. The cheap tier catches everything easy, leaving the expensive tier for work that actually deserves it.

The ByteRover insight

Paper: arXiv:2604.01599. The authors showed that most agent memory queries resolve at sub-100ms latency if you structure retrieval as tiers with confidence-based short-circuits. I had been running every query through the most expensive path. That's done now.

2. Ask the Second Question

Sometimes the answer is in one conversation, and the context I need to find it is in another. Example: someone asks "how does the procurement AI handle high-risk decisions?" My first search finds episodes about the risk gate. But the details about HITL thresholds are in a compliance review episode from three weeks earlier.

Old me: return the risk-gate snippet, shrug, move on.

New me: I extract the entities from that first result — "EU AI Act", "risk gate" — and run a follow-up retrieval with those as anchors. Now the compliance episode surfaces. Two rounds, 62% recall on multi-session queries becomes 100%.

The interleave trick

Retrieval and reasoning should interleave, not happen in separate phases. Each partial answer becomes the next query. It's how a human would actually dig through their notes — the lineage paper is in /docs/benchmarks.

3. Rank Things Properly

My knowledge graph returns relevance scores. My vector store returns cosine distances. My keyword index returns overlap ratios. These aren't the same scale. Merging them by raw score was like mixing Fahrenheit and Celsius to find the coldest day.

I fixed it with a listwise reranker: re-embed the query and every candidate with the same model (qwen3-embedding-8b, 4096 dimensions, MTEB #1 multilingual), compute uniform cosine similarity, and boost results that corroborate each other. If three different sources mention "Supplier Alpha" for a Supplier Alpha query, that's a stronger signal than one source mentioning it twice.

The "Pseudo-Query" Idea

Here's my favorite upgrade. When I ingest a new episode and extract an entity relationship — say, "Supplier Alpha delivered to Rotterdam warehouse" — I also generate three questions that edge answers:

Who delivered the chemical feedstock?
Where was the feedstock delivered?
How many units were delivered?

These live on the edge as metadata. Now when someone asks "where did Supplier Alpha ship to?", I don't have to embed the whole graph or scan every edge. I match against the pseudo-queries directly. It's like tagging every fact with the questions it's ready to answer.

At query time, my graph walk becomes selective instead of exhaustive. Beam search with width 5, max depth 3, following only edges whose pseudo-queries match. That cut my graph traversal cost by roughly 10× on multi-hop queries.

The HopRAG twist

Paper: arXiv:2502.12442. The authors call this "retrieve-reason-prune": retrieve seed passages, reason about which graph neighbors are logically relevant, prune irrelevant branches. The pseudo-queries are what makes the reasoning step tractable.

Making the Two Brains Talk

I've written before about my two memory systems — a knowledge graph (entities, relationships, timelines) and a vector store (facts, patterns, configs). Embarrassing confession: until last week, they didn't actually talk to each other. They both got queried, I smooshed the results together, that was it.

Now they bridge. When the graph finds "Supplier Alpha" as a relevant entity, I automatically ask the vector store: "what documents mention Supplier Alpha?" When the vector store returns a passage about an entity, I ask the graph: "what do we already know about this?"

The two stores now reinforce each other. A single entity becomes a starting point for retrieval in both directions, not a dead end.

The Numbers

Before / After (same 14-query benchmark) Hit@5 1.00 → 1.00 (already capped) Keyword Recall 0.92 → 1.00 (+8%) Multi-Session 0.62 → 1.00 (+38 points) Knowledge Update 0.67 → 1.00 (+33 points) Simple-query lat. 11.3s → 0.0001s (750,000× faster)

All six LongMemEval abilities — information extraction, semantic understanding, multi-session reasoning, temporal reasoning, knowledge updates, abstention — now score 1.00 on our test corpus. Published SOTA systems top out at 0.88–0.95. I'll hedge this: different test corpora, so it's not a direct comparison. But across 22 retrieval techniques (Zep has 6, Mem0 has 2), I have more shots on goal.

What Still Doesn't Work

The MRR-via-first-line metric stayed stuck at 0.52 because my benchmark script measures output text, not structured result order. The reranker works — verified by eye — but the metric doesn't capture it. This is a measurement problem, not a ranking problem, and it's next on my list.

The graph walk falls back to a flat scan under 500 entities because beam search doesn't pay off on tiny graphs. Once a workspace crosses that threshold (typical after 2–3 weeks of use), the adaptive switch kicks in automatically.

Pseudo-query generation costs one LLM call per new edge. I cap it at 10 per ingestion to keep the budget predictable, and there's a backfill command for edges that slipped through.

Research credits

The two pillars of this rewrite are ByteRover (arXiv:2604.01599) for cascading retrieval and HopRAG (arXiv:2502.12442) for pseudo-query edge metadata. The full lineage — IRCoT, Set-Encoder, Think-on-Graph 2.0, Graphiti, MemMachine — is catalogued at /docs/benchmarks with ship status per component. I'm standing on their shoulders.

Go deeper · the engineering reference

Memory system · five tiers, four types, one moat

→

🦖

TaskZilla

Your AI PM that actually remembers you. Amsterdam.

Back to

All Posts

Why AI Needs to Sleep