AI Coding Benchmark Scores Skewed by Infrastructure

Infrastructure configuration, not just AI model prowess, can significantly skew benchmark results, complicating deployment decisions.

5 min read
Abstract visualization of code flowing through a complex network infrastructure.
Image credit: anthropic.com

The race to build the most capable AI coding assistants is being complicated by a fundamental flaw: the testing grounds themselves. Infrastructure, from CPU allocation to memory limits, can swing benchmark results by several percentage points, sometimes more than the actual gap between leading AI models. This means that decisions about deploying AI assistants might be based on flawed data.

Benchmarks like SWE-bench and Terminal-Bench 2.0 are the front lines where AI models battle for supremacy in software engineering tasks. Top models often vie for leaderboard positions separated by mere points. However, Anthropic's research reveals that the environment these models run in is far from passive; it's an integral part of the problem-solving process.

The Hidden Variable: Infrastructure

Unlike static benchmarks that simply score output, agentic coding evaluations provide models with a full development environment. The AI writes code, runs tests, installs dependencies, and iterates over multiple turns. When two agents operate with different resource budgets and time limits, they aren't taking the same test.

Anthropic discovered that their setup on Terminal-Bench 2.0 produced scores that didn't align with the official leaderboard, coupled with surprisingly high infrastructure error rates—up to 6% of tasks failed due to environmental issues, not model limitations.

The discrepancy stemmed from how resources were managed. Kubernetes, in this case, treated per-task resource specifications as both a minimum guarantee and a hard kill limit. This left zero headroom for transient spikes in demand, meaning a momentary memory fluctuation could terminate a container that might have otherwise succeeded.

The benchmark's official leaderboard uses a more lenient sandboxing provider that allows temporary overallocation, prioritizing stability over strict limits. This difference alone can create significant score variations. Anthropic's experiments showed that increasing resource headroom directly correlated with higher success rates, primarily by reducing infrastructure errors.

When More Resources Means Better Performance

Across six different resource configurations for Terminal-Bench 2.0, Anthropic observed that success rates climbed as resource headroom increased. Infrastructure error rates dropped monotonically, from 5.8% in strict enforcement to just 0.5% with uncapped resources. The total lift over the strictest setting was a significant 6 percentage points.

This isn't just about preventing crashes; more resources enable AI agents to tackle tasks that inherently require substantial computational power. This includes pulling large dependencies, spawning expensive subprocesses, or running memory-intensive test suites. Tasks like rstan-to-pystan and compile-compcert saw notable improvements with increased memory headroom.

Redefining What Benchmarks Measure

Up to roughly three times the recommended Terminal-Bench specifications, additional resources primarily fix infrastructure reliability issues. Beyond that threshold, however, the extra compute actively helps agents solve problems they previously couldn't. This reveals a critical point: resource limits alter what the evaluation actually measures.

Tight constraints inadvertently reward hyper-efficient, lean coding strategies. Generous limits, conversely, favor agents that can leverage extensive resources, potentially using brute-force methods or heavyweight tools. Both approaches are valid, but collapsing them into a single score without specifying resource configuration obscures crucial differences and hinders interpretation of real-world generalizability.

For instance, a task like bn-fit-modify requires installing a large Python data science stack. Under generous limits, this works. Under tight ones, the environment runs out of memory during installation before the agent even writes solution code. A leaner, from-scratch implementation is possible, but not all models default to it, and resource configuration dictates which strategy succeeds.

This effect was observed across different Anthropic models and appears to hold for models beyond Claude. A crossover experiment on SWE-bench, varying RAM up to 5x the baseline, showed a similar trend, albeit with a smaller magnitude of 1.54 percentage points. While SWE-bench tasks are less resource-intensive, the finding still indicates that resource allocation is not neutral.

Beyond Resources: Other Confounding Factors

Resource allocation isn't the sole hidden variable. Time limits can also play a significant role. In essence, every element of the evaluation setup—from cluster health and hardware specs to concurrency levels and even network bandwidth—can act as a confounder.

Anecdotal evidence suggests pass rates fluctuate with the time of day, likely due to variable API latency influenced by traffic patterns. This illustrates a broader issue: the line between "model capability" and "infrastructure behavior" is far blurrier than a single benchmark score implies.

Public benchmarks, intended to measure pure model capabilities, risk conflating them with infrastructure quirks. While this can be useful for end-to-end system testing, it's often undesirable for measuring raw AI prowess.

Recommendations for Rigor

The ideal scenario is running evaluations under identical hardware conditions for perfect reproducibility. However, this isn't always practical. Anthropic recommends that evaluations specify both guaranteed resource allocations and hard kill thresholds separately per task, rather than a single pinned value.

This approach provides containers with breathing room to avoid spurious kills from transient spikes while still enforcing a ceiling. For Terminal-Bench 2.0, a 3x ceiling reduced infra errors significantly while keeping score lifts within noise margins, effectively neutralizing the infrastructure confounder without removing meaningful resource pressure.

The impact of these findings is substantial. Benchmark scores increasingly inform critical deployment decisions, yet the rigor in how they are run and reported often lags. A seemingly small lead on a leaderboard might simply reflect beefier hardware or a more opportune execution time.

For AI labs, resource configuration in agentic evaluations must be treated as a first-class experimental variable, documented and controlled with the same rigor as prompt formats. For benchmark maintainers, specifying enforcement methodology alongside resource recommendations is crucial. For consumers of benchmark results, skepticism is warranted for score differences below 3 percentage points unless the evaluation configuration is transparently documented.