RAG Is a Memory Problem, Not Search

Every few weeks, someone on my team or a client's team asks the same question: "Should we use GraphRAG or Self-RAG? What about HyDE? We heard Corrective RAG handles hallucinations better."

I've sat in this meeting a dozen times. Smart engineers with whiteboards full of retrieval pipeline diagrams, benchmarking embedding models, A/B testing chunk sizes. All of it productive-looking. None of it asking the right question.

And if the meetings weren't enough, your LinkedIn feed is full of armchair retrieval architects who've never instantiated a vector store screaming that GraphRAG is "the future" and Self-RAG is "dead," speedrunning every acronym in the RAG cinematic universe to farm impressions. The only thing they're retrieving is engagement.

The RAG conversation has become a framework selection exercise. Pick your variant, wire it up, ship it, tune until the eval metrics stop embarrassing you. But after spending years architecting production AI infrastructure — including systems that serve real-time financial intelligence across multi-agent pipelines — I've come to believe the entire framing is wrong.

RAG isn't a search problem. It's a memory management problem. And until you see it that way, you're optimizing the wrong layer.

The Taxonomy Trap

The RAG ecosystem has exploded into a zoo of acronyms. GraphRAG, Self-RAG, HyDE, Corrective RAG, Adaptive RAG, Agentic RAG, RAG-Fusion, Speculative RAG, LongRAG, Cache-Augmented Generation. Each paper positions itself as a meaningful advance over vanilla RAG. Each one gets a Medium post and a LangChain integration.

But look at what each variant actually does, not what it calls itself. They're all answering the same underlying question: what information gets loaded into the LLM's limited working memory, and what stays outside of it?

That question has nothing to do with search. It has everything to do with memory management.

Vanilla RAG is the simplest page fault handler imaginable. A query comes in, you look up the closest vectors, you load those chunks into the context window. Demand paging with a nearest-neighbor lookup. Works fine until your corpus gets complex enough that "nearest vectors" stops correlating with "actually relevant."

GraphRAG swaps the flat vector index for a graph structure: entities, relationships, community summaries. A query arrives, you traverse the graph for relevant subgraphs, load those into context. That's not a better search. That's a different page replacement policy, one where "relevance" means structural proximity in a knowledge graph instead of cosine similarity in embedding space.

HyDE generates a hypothetical answer first, then uses that embedding to retrieve real documents. Speculative prefetching. The system predicts what the working set will look like and pre-loads accordingly. Clever, but the speculative answer is generated without ground truth. It prefetches based on the model's priors, not reality. For anything time-sensitive — earnings data, policy changes, market signals — those priors are stale on arrival.

Self-RAG adds a reflection step. After retrieving and generating, the model critiques whether the retrieved chunks actually supported the answer. A validity check before committing a page to cache. Keeps the working set clean, but costs you an extra LLM call per cycle.

Cache-Augmented Generation skips retrieval entirely by pre-computing the entire knowledge base into the model's KV cache at startup. Pin your entire working set in RAM. Fast, zero retrieval latency, but only works when your corpus fits in the context window. If the knowledge base changes frequently, it breaks.

Corrective RAG evaluates retrieval quality before generation and falls back to web search if the retrieved documents score too low. A cache miss handler with a fallback to a slower, broader storage tier.

Every one of these is a policy decision about memory management wearing a retrieval costume.

The Reframe: Your Context Window Is RAM

Andrej Karpathy has been making this case since 2023. First in a post on X, then in his "Intro to Large Language Models" talk: the LLM is the CPU, the context window is RAM, tools are system calls. At Sequoia's AI event, he formalized it further. Context window as RAM, model weights as CPU, prompting as programming. The Software 3.0 thesis.

The UC Berkeley team behind MemGPT (now Letta) took it further and actually built the virtual memory layer. Main context as RAM, recall storage as disk, archival storage as cold storage, function calls as memory management operations. Their agents outperform naive approaches on document analysis and multi-session conversations precisely because they manage memory explicitly instead of hoping the context window sorts itself out.

What neither Karpathy's framing nor MemGPT's implementation does is connect this back to RAG. Karpathy gives us the architecture diagram. MemGPT gives us the paging mechanism. But nobody has asked: if retrieval is memory management, then what are all these RAG variants actually doing in OS terms? That's where the taxonomy above starts to matter, not as a framework selection guide, but as a field guide to page replacement policies.

Your LLM's context window is RAM. Fixed, expensive, fast-access working memory. Everything the model can reason about in a single inference pass lives here. Hard capacity limit, whether that's 8K, 128K, or 1M tokens. Every token you put in displaces something else that might have been more relevant.

Your document corpus is disk. Large, slow to access, organized by some indexing scheme that may or may not match the access patterns of your actual workload.

Your retrieval pipeline is the page fault handler. The model needs information that isn't in context, a fault fires, and the retrieval system decides what to load from disk into RAM.

Belady's optimal algorithm, published in 1966, proves that the best possible page replacement policy is to evict the page that won't be needed for the longest time in the future. You can't implement this in practice because it requires knowing the future. So real operating systems approximate it: LRU evicts the least recently used page, LFU the least frequently used, clock algorithms scan a ring buffer. All approximations of the same ideal.

RAG systems have the same problem. When the context window fills up and new information needs to come in, something has to go. Most systems handle this by truncating the oldest turns or dropping the lowest-scoring chunks. Some just stuff everything in and hope the model figures it out. That's FIFO page replacement. Simple. Often wrong.

Can we build a better approximation of Belady's for context windows? I think so, and it involves knowledge graphs in a way that most GraphRAG implementations haven't explored. More on that in the next post.

What Changes When You See It This Way

Once you adopt the memory management lens, you stop evaluating RAG variants as point solutions and start designing a coherent memory hierarchy. The question becomes: what does our memory architecture look like end to end?

Think of it in four tiers.

The system prompt is your L1 cache. Fastest, most persistent. It loads on every inference call and never gets evicted. Whatever you put here had better be worth the token cost, because it competes directly with the working context for every single request. Most teams stuff their system prompts with boilerplate instructions that could be retrieved on demand. That's wasting your most expensive memory tier on low-value content.

Tool results and recent conversation turns are your L2 cache. Short-lived, high relevance, scoped to the current session. This is your working set for the active reasoning thread. Recent tool outputs, the last few turns, any artifacts the model just generated. When turns go stale, you summarize or evict them. Don't retain verbatim.

Retrieved chunks are main memory. This is where your RAG pipeline operates. The quality of your retrieval strategy determines how well this tier serves the current reasoning needs. A bad retrieval policy here is like running a database with a terrible cache hit rate. Technically functional, but burning cycles on irrelevant I/O.

The full corpus is disk. Everything you've ingested. The structure of this tier, whether it's a flat vector store, a knowledge graph, a relational database, or a markdown wiki, determines what retrieval strategies are even possible. A flat chunk store limits you to vector similarity. A knowledge graph opens up structural traversal. The storage layer constrains the memory management policy.

Most teams design these tiers independently. One person writes the system prompt. Another builds the retrieval pipeline. Conversation management is an afterthought. Nobody is thinking about how information flows between layers or what the eviction policy should be at each boundary.

That's the architectural gap. Not "which RAG variant." Who manages the full hierarchy.

The Missing Piece

Nobody has built the operating system for this yet. That keeps nagging at me.

We have individual components. Vector databases are mature. Knowledge graph tooling is getting better, slowly. Agentic frameworks can orchestrate multi-step retrieval. But the orchestration layer — the thing that watches the current reasoning thread, predicts what's needed next, loads it proactively, evicts what's gone stale — is still duct-taped together with prompt engineering and hardcoded retrieval calls.

In operating systems, that orchestration layer is the kernel's virtual memory manager. Most performance-critical subsystem in the entire stack. Took decades of research to get right.

For LLM systems, we're still manually swapping memory segments in and out and calling it good enough.

There is one architectural primitive that I think gets us closer, and it's knowledge graphs. Not in the way GraphRAG currently uses them. The insight has to do with graph distance as a semantic proximity metric and its relationship to a well-known result from the paging literature. I'll unpack that in the next post.

The Takeaway

If your team is sitting in a meeting debating which RAG variant to adopt, you're having the wrong meeting.

The question isn't "GraphRAG or Self-RAG?" The question is: what does our memory hierarchy look like, who manages the boundaries between tiers, and what's our eviction policy when the context window fills up?

Get that architecture right and the retrieval strategy becomes an implementation detail. Get it wrong and no amount of framework-hopping will fix your production hallucination rate.

Stop optimizing retrieval. Start designing memory.

CogniArk designs and operates cloud infrastructure for AI-native companies — model serving, multi-agent pipelines, platform engineering, and FinOps on AWS.