Home » Technology » AI Inference: The Race for Tokens Per Watt & Goodput Efficiency

AI Inference: The Race for Tokens Per Watt & Goodput Efficiency

In the rapidly evolving landscape of AI, the economics of inference at scale are becoming increasingly critical. Often described as factories where power inputs yield tokens as outputs, AI datacenters are under enormous pressure to maximize efficiency. The more tokens they can generate for a given amount of power, the better. Generating enough tokens to cover infrastructure, power, and operational costs can lead to significant profits. “For the datacenters, inference tokens per watt translates directly to the revenues of the CSPs” (cloud service providers), noted Nvidia CEO Jensen Huang in a recent earnings call.

This analogy to manufacturing highlights the competitive advantage that comes from optimizing the number of tokens produced per second, per dollar, and per watt (TPS/$/W). However, the reality of scaling inference is complex. It’s not merely a matter of adding more GPUs to generate more tokens.

As highlighted by Dave Salvator, director of accelerated computing products at Nvidia, “It’s not one size fits all in terms of the answer. There are SLAs, there’s different application types.” This complexity means that organizations must consider how many TPS/$/W they can generate for a given “goodput,” which refers to service-level targets like time to first token or per-user generation rates.

Understanding Tokenomics and Goodput

The benchmark provided by SemiAnalysis’s InferenceX (formerly InferenceMax) offers insightful data on performance scaling and economics in generative AI inference. The efficiency Pareto curve illustrates the trade-offs between total token throughput per megawatt and user interactivity. The ideal performance lies in maximizing both aspects.

The Pareto curve categorizes tokens into three groups: bulk tokens, which are cheaper but slower; low-latency tokens, which approach at a premium; and the “Goldilocks zone,” where a balance of interactivity and throughput is achieved. This zone offers sufficient interactivity while still being cost-effective.

The Role of Software in AI Inference

Achieving optimal goodput is not solely dependent on hardware; software plays a significant role. For instance, vLLM, a popular inference serving framework, performs well with certain models but may underperform with others, such as SGLang or TensorRT LLM. This variability is one reason Nvidia promotes its inference microservices (NIMs), designed to simplify deployment and enhance efficiency.

InferenceX data indicates that Nvidia’s TensorRT LLM running on B200 GPUs can serve models like DeepSeek R1 more efficiently than alternatives. Nevertheless, open-source inference engines remain valuable for hyperscalers as they can be tailored to specific workloads.

Disaggregated Compute: A New Approach

The introduction of disaggregated serving frameworks, such as Nvidia’s Dynamo and AMD’s MoRI, marks a significant advancement in inference efficiency. These frameworks allow workloads to be distributed across multiple GPUs, optimizing resource use by separating compute-intensive tasks from bandwidth-limited processes.

The balance of prefill GPUs to decode GPUs can vary based on the model and desired goodput. For instance, latency-sensitive applications may require more decode GPUs, while high user volume tasks may benefit from more prefill GPUs. This approach maximizes efficiency and performance in inference tasks.

Rack-Scale Architectures and Their Impact

The shift towards rack-scale architectures, including systems like Nvidia’s NVL72 and AMD’s Helios, is changing how AI systems are built. These architectures feature multiple GPUs connected by high-speed fabrics that reduce latency and enhance throughput. Finding the ideal combination of expert, pipeline, data, and tensor parallelism is crucial for meeting goodput targets efficiently.

Comparing Nvidia’s enterprise-focused B300s to its rack-scale GB300s reveals that while smaller systems perform well under low user interactivity, they struggle with high demand. The rack-scale systems maintain interactivity without sacrificing throughput.

Currently, Nvidia leads the market with a mature rack-scale platform, but AMD’s MI455X-based Helios systems are expected to launch in the latter half of 2026, potentially offering comparable performance.

Cost Efficiency and Future Considerations

Cost efficiency in inference systems is paramount, particularly as organizations evaluate their hardware choices. Nvidia and AMD’s smaller systems remain competitive at higher interactivity levels, while rack-scale architectures provide advantages in throughput. As the tech landscape continues to evolve, companies must consider the implications of their hardware and software strategies on performance, and cost.

With advancements in AI software and hardware, the state of inference is a rapidly moving target. Failure to update software stacks could hinder performance, as noted by industry experts. “The state of the art of AI is very much a moving target,” Salvator remarked, emphasizing the continuous optimization efforts underway.

As the industry shifts towards lower precision models, the economics of inference will likely favor these approaches, reducing memory needs and computational demands. OpenAI’s GPT-OSS is one such model that has begun to leverage these advancements.

In this competitive landscape, inference providers must not only optimize their hardware and software stacks but also differentiate themselves in an increasingly commoditized market. The development of customized solutions will grow essential for success.

As we seem to the future, the integration of advanced architectures and optimized software will define the next phase of AI inference technology. The focus on cost-effective solutions coupled with high performance will determine which providers thrive in this rapidly evolving marketplace.

We encourage readers to share their thoughts on the future of AI inference and how these developments might shape the industry.

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Adblock Detected

Please support us by disabling your AdBlocker extension from your browsers for our website.