The Scaling Hypothesis at Ten Trillion: What the Benchmarks Mean

Claude Mythos did not emerge from nowhere. It emerged from a debate that has raged in AI research since 2020. The scholarship was right.

This debate is not new. It has been running for six years, and most commentators treating Claude Mythos as a sudden rupture have not read the original papers. Situate the model in its intellectual history, and the picture clarifies.

In January 2020, Jared Kaplan and colleagues at OpenAI published "Scaling Laws for Neural Language Models." The paper established a power-law relationship between model size, dataset size, compute budget, and performance. The central claim was precise: loss decreases as a smooth function of scale across seven orders of magnitude, and the relationship holds with remarkable predictability. That paper launched the scaling hypothesis as a research program.

Two years later, DeepMind published the Chinchilla paper, which corrected the compute-optimal ratio. Kaplan had suggested that models should grow faster than datasets. Chinchilla demonstrated that training tokens and parameters should scale in roughly equal proportion. The correction mattered because it changed how labs allocated their budgets, but it did not challenge the core hypothesis. Scale still worked. The question was how to spend the compute.

Kaplan et al. (2020) established that language model performance improves as a smooth power-law function of scale across seven orders of magnitude.

Verified

By 2024, a counter-narrative had gained traction. Researchers pointed to diminishing returns on standard benchmarks. Smaller models trained with better techniques closed gaps with larger ones. The "scaling is dead" thesis attracted significant attention. Some researchers argued that architecture innovations, data quality, and post-training methods had overtaken raw parameter count as the primary lever for improvement.

Biased Bipartisans

Sponsored

Real-Time, Evidence-Based News Reports

Unlimited access to your personalized investigative reporter agent, sourcing real-time and verified reports on any topic. Your personalized news feed starts here.

Create Free Account

Claude Mythos answers that debate with data. Ten trillion parameters. A 93.9% on SWE-bench Verified, compared to 80.8% for Opus 4.6. A 77.8% on SWE-bench Pro, compared to 53.4%. An 82% on Terminal-Bench 2.0. A 97.6% on USAMO 2026. These are not incremental gains. The gap between Mythos and the next best model is larger than the gap between Opus 4.6 and models from two years earlier.

The gap between Mythos (93.9%) and Opus 4.6 (80.8%) on SWE-bench Verified is 13.1 percentage points, larger than the improvement from GPT-4 to Opus 4.6.

Verified

The scholarly record demands precision here. Mythos does not prove that raw parameter count is the only variable that matters. Anthropic combined scale with architecture improvements, better data curation, and post-training refinement. The correct reading of the evidence is that scaling continues to produce capability gains when paired with complementary improvements. The scaling hypothesis in its strongest form (scale alone is sufficient) may be too simple. The scaling hypothesis in its moderate form (scale remains necessary for frontier capability) appears validated.

The cybersecurity results add a dimension that benchmarks alone cannot capture. Mythos found a 27-year-old vulnerability in OpenBSD. It identified a 16-year-old flaw in FFmpeg that five million automated test runs missed. It chained Linux kernel vulnerabilities into a full privilege escalation. These are not pattern-matching tasks. They require reasoning about code structure, understanding system-level interactions, and constructing multi-step exploitation paths. The scholarly literature on emergent capabilities in large models predicted exactly this kind of qualitative shift at sufficient scale.

“
Capybara is a new name for a new tier of model: larger and more intelligent than our Opus models, which were, until now, our most powerful. -- Anthropic draft blog post, leaked March 2026

Biased Bipartisans

Sponsored

Think Further on BIPI.

Where seeking the truth is a journey, not a destination.

Learn more

The intellectual genealogy of "emergence" in AI traces back to Jason Wei and colleagues at Google, who published "Emergent Abilities of Large Language Models" in 2022. The paper documented capabilities that appear unpredictably as models cross certain scale thresholds. Critics, including Rylan Schaefer and colleagues, argued in 2023 that apparent emergence was an artifact of metric choice rather than a genuine phase transition. Mythos complicates both positions. The cybersecurity capabilities do not appear to be metric artifacts. A model either chains kernel exploits or it does not. But whether the capability emerged at a specific threshold or accumulated gradually remains an open empirical question.

The competitive landscape provides additional context. DeepSeek V4, a one-trillion parameter mixture-of-experts model trained for 5.2 million dollars, achieves performance competitive with American frontier models at a fraction of the cost. Google Gemini 3.1 offers 2.5x faster processing. Alibaba Qwen 3.5-Omni provides native omnimodal capabilities in an open-weight model. The field has not converged on a single path. Some labs pursue raw scale. Others pursue efficiency. Some pursue both.

The FFmpeg vulnerability Mythos discovered had survived 16 years and five million automated test executions without detection.

Verified

The deeper lesson from the scholarly record is that the debate between "scale" and "efficiency" presents a false binary. The most capable systems combine both. Kaplan showed that scale works. Chinchilla showed how to scale efficiently. Mythos appears to incorporate both lessons at a new order of magnitude. The question now shifts from whether to scale to who gets to scale and under what conditions.

The scholarship on this question is more divided than most popular accounts suggest. The scaling hypothesis was always an empirical claim, not a mathematical proof. It held across seven orders of magnitude in 2020. It appears to hold at ten trillion parameters in 2026. Whether it continues to hold at a hundred trillion, or whether some other constraint emerges, remains genuinely unknown. The honest scholarly position is that Mythos is a strong data point in a running experiment, not a final answer. The record shows that anyone claiming certainty about the trajectory from here has outrun the evidence.