Train-to-Test ($T^2$) Scaling Laws: Optimizing LLM Training and Inference

University of Wisconsin-Madison and Stanford researchers have introduced Train-to-Test (T²) scaling laws, a framework that jointly optimizes model size, training data volume and inference-time sampling count to minimize total compute cost for reasoning-intensive AI workloads, demonstrating that overtrained little models can outperform larger Chinchilla-optimal models under fixed budgets when generating multiple reasoning samples.

For years, AI scaling has operated under a dangerous fiction: that training and inference are separate problems. The Chinchilla optimum—20 training tokens per parameter—became gospel not given that it was universally optimal, but because it minimized pre-training loss under isolated assumptions. Yet as agentic workflows proliferate, from code generation to mathematical reasoning, the real bottleneck shifted to inference. When each sample costs heavily due to model size, techniques like pass@k sampling develop into prohibitively expensive. T² scaling resolves this by treating training and inference as a single constrained optimization problem, where the objective is to maximize downstream accuracy per unit of total compute—pre-training plus inference.

The framework’s core insight is elegantly brutal: under a fixed compute budget C, total cost is modeled as 6ND (pre-training) + 2Nk (inference), where N is parameter count, D is training tokens, and k is the number of reasoning samples generated at test time. Performance is then approximated either via modified loss (for internal research) or pass@k (for practical deployment). By jointly optimizing N, D, and k, T² reveals that the compute-optimal frontier pivots hard toward extreme overtraining—think 100+ tokens per parameter—on compact models, then leveraging the saved FLOPs to run dozens of inference samples. In their 5M-to-901M parameter testbed, overtrained 100M-token models routinely beat 300M-token Chinchilla baselines on ARC, GSM8K, and OpenBookQA when k≥8, despite identical pre-training compute.

“We’re not saying bigger models are useless—we’re saying that for reasoning under latency or cost constraints, throwing more data at a smaller model and letting it ‘think longer’ at inference is often the smarter move.”

— Nicholas Roberts, co-author of the T² scaling paper and PhD candidate at Stanford’s AI Lab, speaking at the NeurIPS 2025 Workshop on Efficient Foundation Models.

What makes T² actionable today is its compatibility with existing inference optimizations. KV caching, already standard in Hugging Face Transformers and vLLM, amortizes the cost of re-processing prompts across multiple samples. For a 70B parameter model generating 16 samples, KV caching reduces memory bandwidth needs by ~12x versus naive recomputation, turning what was a bandwidth-bound nightmare into a feasible latency target. Even more promising, recent perform from Cerebras on weight-stationary execution shows that for overtrained models with high reuse patterns, inference energy per token can drop below 10 pJ—closing in on the theoretical Landauer limit for synaptic operations.

This isn’t just academic curling. The implications ripple through the AI supply chain. Cloud providers like AWS and GCP, whose GPU-hour pricing models implicitly reward Chinchilla-style training, may see demand shift toward compute-optimized overtraining workloads—favoring burstable, memory-optimized instances over raw FP8 throughput. Meanwhile, open-source communities stand to gain: if state-of-the-art reasoning no longer requires frontier models, the barrier to entry for fine-tuning Llama-3 or Mistral derivatives plummets. Hugging Face’s recent “Inference Scaling” leaderboard, which now includes pass@k metrics alongside traditional benchmarks, implicitly validates this shift—though it still lacks a T²-aware Pareto frontier optimizer.

Yet limits loom. Kaynakça walls are real. The team warns that beyond ~200 tokens per parameter, marginal returns diminish not from algorithmic limits but from data scarcity—high-quality, deduplicated text simply doesn’t scale indefinitely. Worse, extreme overtraining induces optimization pathologies: loss landscapes develop sharper minima, making fine-tuning brittle. In their ablation studies, supervised fine-tuning on overtrained models required 2–3× more steps to converge, though final performance remained unaffected once converged. For enterprises, this means adopting T² isn’t just about compute allocation—it’s about data curation pipelines and fine-tuning hygiene.

Where this hits hardest is in the enterprise AI stack. Consider a coding agent using pass@16 sampling to generate secure, bug-free patches. Under Chinchilla, you might deploy a 1B-parameter model at $0.004/query. With T², you could train a 250M-parameter model on 50B tokens (200 tokens/param) and achieve better pass@16 at $0.0015/query—a 62% reduction. Multiply that across millions of daily invocations, and the savings fund entire data engineering teams. Crucially, this doesn’t require new hardware; it works on existing H100s or even TPU v4 slices, provided your inference engine supports efficient sampling—something vLLM, TensorRT-LLM, and DeepSpeed-Inference all now offer via configurable sampling backends.

The real victory here is democratization. For years, the narrative has been that only those with access to exaFLOP superclusters could train state-of-the-art reasoners. T² flips that: with great data and disciplined budget allocation, a mid-sized startup can now rival the reasoning performance of labs with 100× the compute—provided they’re willing to overtrain aggressively and sample wisely at inference. As Roberts set it bluntly: “You don’t need more GPUs. You need better arithmetic.”

Looking ahead, the next frontier is dynamic T²—where k adapts per-query based on estimated problem difficulty. Early prototypes from MIT’s Spinning Up team show that allocating more samples to hard problems (e.g., Olympiad-level math) while using k=1 for trivial queries can yield another 1.4× efficiency gain. But that’s a story for next quarter. For now, the message is clear: stop optimizing for training loss alone. Start optimizing for what your system actually does—reason, sample, decide—and let the compute follow.

Photo of author

Sophie Lin - Technology Editor

Sophie is a tech innovator and acclaimed tech writer recognized by the Online News Association. She translates the fast-paced world of technology, AI, and digital trends into compelling stories for readers of all backgrounds.

Thunderbirds vs. Melbourne Vixens: Sold-Out Clash at John Cain Arena

Takaichi’s Consistent Shinto Shrine Visits During Cabinet Tenure

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.