In the rapidly evolving landscape of artificial intelligence, the efficiency of training large language models (LLMs) is a critical focus for researchers, and developers. Pretraining modern LLMs, often containing around 100 billion parameters, typically involves the use of thousands of accelerators and extensive token corpora. This process can run for days or even months, emphasizing two key outcomes: speed, measured in tokens per second, and learning, assessed through progress tracked against wall-clock time.
Understanding what constitutes “fast” in this context and how it can be measured is essential for improving AI training efficiency. Although raw throughput, expressed as tokens per second, is an essential metric, it is also context-sensitive. Factors such as GPU count, network topology, and model architecture all influence throughput, meaning it is an outcome rather than a normalized measure of efficiency. A more nuanced metric is needed to evaluate progress as a fraction of potential capacity, rather than simply an absolute rate.
From Throughput to Goodput
The term “goodput” has emerged to address these concerns, shifting the focus from how many tokens per second are processed to the actual fraction of a system’s potential that translates into useful training progress. This concept was formalized by Google as ML Productivity Goodput, which offers an API-driven approach to compute goodput and identify sources of lost productivity, known as badput.
Goodput emphasizes the importance of accounting for lost time and wasted compute resources, making it an actionable metric that can provide insights into training efficiency. It quantifies how effectively the training capacity is being utilized in a structured manner.
Understanding Training Goodput
Training goodput is defined as the fraction of theoretical training capacity that translates into effective training progress, represented as a value between 0 and 1. A score of 1.0 indicates continuous productivity without significant time lost to disruptions, while a score of 0.5 suggests that about half of the potential is being wasted, often due to downtime or inefficiencies.
To effectively measure goodput, it is essential to break down the training process into three distinct layers:
- Infra Layer: This layer focuses on system availability, capturing the fraction of time the training job is in a healthy state.
- Framework Layer: It assesses the efficiency of state-saving and recovery processes, measuring how much progress is lost during failures.
- Model Layer: This evaluates how effectively the training program utilizes the computational resources available, particularly through metrics like Model FLOPs Utilization (MFU).
Layers of Goodput
Each layer of goodput offers valuable insights into different aspects of the training process:
Infra Goodput
Infra goodput measures the availability of the training infrastructure. It quantifies the time spent in a productive state versus downtime due to faults or orchestration delays. This aspect is crucial for identifying reliability issues that can disrupt training workflows.
Framework Goodput
Framework goodput evaluates the efficiency of checkpointing and recovery processes. It highlights the time lost to overheads during these operations, underscoring the importance of balancing checkpoint frequency to minimize system tax while avoiding excessive rollback losses.
Model Goodput
Model goodput assesses how well the system converts computational power into effective training. A low MFU can indicate various inefficiencies, including communication overhead and poor parallelism configurations, which can degrade overall performance.
Calculating Goodput
To compute training goodput, organizations should establish a measurement window, typically 24 hours, and consistently log productive training time. This includes recording the transitions between training-active states and other non-productive intervals. Tying each disruption to a specific fault event helps in accountability and understanding how failures affect overall training time.
Once the goodput is calculated, it can provide a single, stack-aware metric representing overall training efficiency. This approach not only clarifies where productivity losses occur but also directs attention to the layers that can be optimized for better performance.
Conclusion
As the complexity of training LLMs continues to grow, understanding and improving training efficiency will be paramount. While traditional metrics like throughput are useful, they do not capture the full picture. Transitioning to a goodput framework allows for a more comprehensive analysis of training efficiency, focusing on how effectively resources are utilized throughout the training process.
Looking ahead, organizations engaged in large-scale machine learning systems should prioritize discussions around stack-level goodput as a means of enhancing productivity and reliability in their AI training endeavors. Engaging with the latest methodologies and technologies will be critical to maintaining a competitive edge in the AI landscape.