Agentic RAG is the New Baseline: Context Engineering Shifts from Component Hacks to Full System Design

4 min read
Agentic RAG is the New Baseline: Context Engineering Shifts from Component Hacks to Full System Design

The speed of innovation in applied AI has collapsed the timeline for new disciplines, forcing practitioners to move from prototype to production almost overnight. For context engineering, the process of reliably supplying large language models (LLMs) with necessary external information, 2025 felt like "six months compressed into a year." This rapid evolution is driving a fundamental shift in focus: away from optimizing individual components and toward establishing robust, end-to-end system architectures capable of operating at enterprise scale.

This was the core insight delivered by Nina Lopatina, Lead Developer Advocate at Contextual AI, who spoke with Swyx, Editor of Latent Space, live at NeurIPS 2025. Lopatina, whose background spans neuroscience and reward learning, highlighted the industry’s scramble to turn context engineering from a collection of design patterns into a full-stack discipline, complete with benchmarks and tooling designed for real-world complexity.

The most immediate change observed in the field is the obsolescence of basic Retrieval-Augmented Generation (RAG). Simple retrieval is no longer sufficient for complex enterprise queries. Lopatina confirmed that "agentic RAG is now the baseline: query reformulation into subqueries improved performance so dramatically it became the new standard (normal RAG is dead)." This shift reflects the necessity of having the LLM dynamically break down a user’s complex query into multiple, targeted subqueries, retrieve diverse documents, and then synthesize the answer—a process that demands sophisticated control flow and robust infrastructure.

Yet, introducing agency necessitates strict guardrails. The industry is quickly learning that autonomous agents require explicit constraints to maintain reliability and performance at scale. During a recent Retail Universe hackathon, Lopatina’s team worked with a dataset comprising nearly 100,000 documents, including PDFs, CSVs, and log files—a real-world data landscape far removed from academic toy examples. They found that sub-agents needed defined turn limits and validation loops because "unlimited agency degrades performance and causes hallucinations." The inherent drive of an agent to exhaustively search every possible avenue or continuously check its own work quickly becomes an anti-pattern when dealing with massive, production-scale datasets.

Scaling these systems introduces challenges that researchers are only beginning to quantify. The issue of context rot—where models ignore relevant information buried deep within long context windows—is universally acknowledged, yet concrete, actionable data remains scarce. Lopatina noted that "context rot is cited in every blog but industry benchmarks at real scale (100k+ documents, billions of tokens) are still rare." Anthropic’s recent work, which put hard numbers on the problem—showing retrieval dropping to 30% when relevant context is placed at 700k tokens in a 1M window—is finally making the problem quantifiable and forcing developers to be intentional about context placement and compression.

The emergence of Multi-Component Prompting (MCP) servers, allowing developers to register and discover tools via giant JSON schemas, has been a double-edged sword. While MCP servers accelerate rapid prototyping by abstracting tool management, they also contribute significantly to context bloat, consuming precious tokens merely describing the available functions. The long-term trend is optimization: moving away from verbose schemas and toward leaner, direct API calls once the system design is validated.

Instruction-following re-rankers are becoming critical components of high-performance pipelines. These smaller, specialized models sit between the initial dense retrieval phase and the final context window supplied to the LLM, ensuring that while initial retrieval prioritizes high recall, the final context window maintains high precision. This is particularly important for dynamic agents reasoning over large databases.

A key optimization technique for multi-turn agents is the strategic management of the Key-Value (KV) cache. The decision-making framework is simple: stuff that doesn’t change (like the system prompt or early conversational turns) goes up front in the cache, while dynamically generated context (like recent turns or tool outputs) goes at the bottom. This approach stabilizes the agent’s core identity and instructions across turns while maintaining efficiency. Ultimately, models are not yet sophisticated enough for automatic compaction, necessitating intentional context compression, even if it means proactively limiting conversational turns, as Lopatina does in her own development workflow.

Looking ahead, the discussion is moving past individual component breakthroughs. The next frontier in 2026, according to Lopatina, is full-system design, where "The conversation shifts from 'how do I optimize my re-ranker' to 'what does the end-to-end architecture look like for reasoning over billions of tokens in production?'" This systemic view, encompassing multimodal ingestion, hybrid search, constrained agents, and strategic context management, marks the maturation of context engineering into a true discipline.