The Memory Gold Rush Is Already Over
Fifth post in the Bet-Driven Development series. Start with Post 1 if you missed it.
Everyone is building AI memory right now.
Open any dev community and you’ll see it: “How I built a second brain for my AI agent.” Obsidian vaults indexed with vector search. Markdown file collections that persist across sessions. Knowledge graphs that track relationships between concepts. MCP servers backed by SQLite. The pitch is always the same — your AI forgets everything between sessions, and if you solve that, you win.
I built one too. DevKeel stores context in SQLite with full-text search and BM25 ranking. It retrieves relevant decisions, conventions, and learnings based on what you’re working on. It works well.
And I’m here to tell you: memory is not the moat you think it is.
Noah’s organized chaos
Let’s go back to Noah. After StudioPulse started getting traction, Noah got serious about his development workflow. He set up a system: every decision, every API design choice, every rejected approach went into a markdown file. He organized them by topic. He linked them together with wiki-style references. He even built a small MCP server that indexed everything into a local database.
For a few weeks, it felt like a superpower. His agent started sessions already knowing things. Context from last Tuesday showed up when it was relevant on Thursday. Noah told himself he’d cracked the problem.
Then the files kept growing. Fifty entries became two hundred. Some contradicted each other — he’d decided on REST in February and GraphQL in March, and both entries were still there. His agent started pulling in context that was technically relevant but practically outdated. The recommendations got subtly worse, not better, and Noah couldn’t figure out why.
He was experiencing something that ETH Zurich researchers documented in March 2026: LLM-generated context files actually reduced task success by 3% while increasing inference costs by 20%. More memory made agents worse, not better. The researchers’ recommendation was blunt — consider omitting LLM-generated context files entirely.
Noah had built a system that remembered everything and understood nothing.
The retrieval arms race
Noah’s natural response was to build better retrieval. If the problem was pulling in bad context, the solution was pulling in better context. He researched vector embeddings. He looked at knowledge graphs. He considered hybrid search — combining keyword matching with semantic similarity.
He wasn’t alone. Mem0 raised $24 million building a memory layer with knowledge graphs. Letta raised $10 million building an agent runtime with tiered memory. Zep built temporal knowledge graphs that track how facts change over time. The market was telling a clear story: memory retrieval is a hard problem, and hard problems are business opportunities.
Then Letta published a benchmark that should have given everyone pause. They tested agents using simple filesystem operations — grep, search_files, open and close — against Mem0’s knowledge graph approach. Simple filesystem operations scored 74.0% accuracy. The knowledge graph scored 68.5%.
Simple beat sophisticated.
Letta’s conclusion: “Memory is more about how agents manage context than the exact retrieval mechanism.”
A smart agent with dumb storage outperforms a dumb agent with smart storage. The intelligence is in the agent, not the index.
What the platforms are telling you
Here’s what got me to stop and reconsider. Look at what the major AI coding tools are doing with memory:
Claude Code uses flat markdown files. No search index. No database. Files are loaded into context at session start based on a directory hierarchy. Anthropic’s recommendation: keep it under 200 lines.
Cursor had a built-in memory feature. They removed it. Told users to export their memories into rule files or plug in an MCP server. They tried structured memory, measured the results, and pulled it.
Codex uses AGENTS.md — a single markdown file — plus conversation state. Flat.
Windsurf has the most integrated approach, with auto-generated memories and codebase indexing. But even they recommend static rules over auto-generated memories for reliability.
Nobody is building structured retrieval natively. The platforms are betting that growing context windows plus smarter models will be enough. Or that MCP fills the gap for anyone who needs more. Either way, the message is the same: memory storage is not where the value is.
What I actually learned
When I looked at DevKeel honestly — not at what I’d built, but at what actually made my development better — the memory layer wasn’t the answer.
Don’t get me wrong. Selective retrieval is better than loading everything. BM25 ranking returns more relevant results than reading files sequentially. The database works.
But the moments where DevKeel genuinely changed my work weren’t when the agent recalled a decision from two weeks ago. They were when the agent pushed back on what I was about to build. When defining a signal forced me to confront that I didn’t know what success looked like. When a review gate caught something I’d missed because I was moving too fast. When resolving a bet — win or lose — produced a learning I could carry forward.
The memory layer is plumbing. The methodology layer is the product.
Every tool in the market is racing to build a better memory. Nobody is racing to build a better decision-making process. The bottleneck was never “can my AI remember things?” It was always “does my AI help me build the right thing?”
The first beta tester who proved it
This became undeniable when our first beta tester used DevKeel for a few weeks, understood the value, and then went and built his own system using other tools. He didn’t copy the database schema or the retrieval algorithm. He copied the methodology — the bet cycle, the signals, the discipline of framing work as testable hypotheses.
The storage was replaceable. The framework wasn’t.
That’s the clearest market signal I’ve received. When your first user validates the problem by rebuilding the solution with commodity tools, you know exactly where the value is — and isn’t.
The scaling wall that’s coming
If you’re using markdown files for AI memory today — five, ten, fifty files — it probably feels great. Keep going.
At a hundred files, retrieval gets fuzzy. At a thousand, the model starts ignoring things. Context distraction sets in past roughly 100,000 tokens. OpenAI’s o3 dropped from 98.1% accuracy to 64.1% when fed contradictory context. The “just put everything in the context window” strategy has a ceiling, and it’s lower than you think.
The people building “virtual brains” out of Obsidian vaults and markdown collections will hit this wall. Some are hitting it now. And when they do, they’ll reach for the same solution everyone reaches for — better retrieval, better indexing, better search.
But Letta’s benchmark already told us: that’s not where the leverage is. The question isn’t how to remember more. It’s what to do with what you remember.
Where this leaves us
I stopped investing in DevKeel’s memory layer. Not because it’s broken — it works fine. But because the market is solving that problem with or without me. Claude’s context windows keep growing. MCP makes memory servers pluggable. Open source projects are shipping retrieval solutions weekly. The room is crowded and getting more crowded.
The empty room is the one where your AI agent has opinions about what you should build. Where it frames work as hypotheses. Where it defines success before writing code. Where it checks signals and captures learnings. Where the discipline lives.
Memory is the floor. Methodology is the ceiling.
Your AI agent doesn’t need a better memory. It needs a keel.
Next in the series: What Teresa Torres Taught My AI Agent — how a product discovery framework designed for human teams maps almost exactly onto bet-driven development, and what it reveals about the gap in AI-assisted workflows.