Home » Technology » NVIDIA and Partners Propel Development of Next-Generation Gigawatt AI Factories for Vera Rubin Telescope Project

NVIDIA and Partners Propel Development of Next-Generation Gigawatt AI Factories for Vera Rubin Telescope Project

by Sophie Lin - Technology Editor

At the OCP Global Summit, NVIDIA is offering a glimpse into the future of gigawatt AI factories.

NVIDIA will unveil specs of the NVIDIA Vera Rubin NVL144 MGX-generation open architecture rack servers, which more than 50 MGX partners are gearing up for along with ecosystem support for NVIDIA Kyber, which connects 576 Rubin Ultra GPUs, built to support increasing inference demands.

Some 20-plus industry partners are showcasing new silicon, components, power systems and support for the next-generation, 800-volt direct current (VDC) data centers of the gigawatt era that will support the NVIDIA Kyber rack architecture.

Foxconn provided details on its 40-megawatt Taiwan data center, Kaohsiung-1, being built for 800 VDC. CoreWeave, Lambda, Nebius, Oracle Cloud Infrastructure and Together AI are among other industry pioneers designing for 800-volt data centers. In addition, Vertiv unveiled its space-, cost- and energy-efficient 800 VDC MGX reference architecture, a complete power and cooling infrastructure architecture. HPE is announcing product support for NVIDIA Kyber as well as NVIDIA Spectrum-XGS Ethernet scale-across technology, part of the Spectrum-X Ethernet platform.

Moving to 800 VDC infrastructure from traditional 415 or 480 VAC three-phase systems offers increased scalability, improved energy efficiency, reduced materials usage and higher capacity for performance in data centers. The electric vehicle and solar industries have already adopted 800 VDC infrastructure for similar benefits.

The Open Compute Project, founded by Meta, is an industry consortium of hundreds of computing and networking providers and more focused on redesigning hardware technology to efficiently support the growing demands on compute infrastructure.

Vera Rubin NVL144: Designed to Scale for AI Factories

The Vera Rubin NVL144 MGX compute tray offers an energy-efficient, 100% liquid-cooled, modular design. Its central printed circuit board midplane replaces traditional cable-based connections for faster assembly and serviceability, with modular expansion bays for NVIDIA ConnectX-9 800GB/s networking and NVIDIA Rubin CPX for massive-context inference.

The NVIDIA Vera Rubin NVL144 offers a major leap in accelerated computing architecture and AI performance. It’s built for advanced reasoning engines and the demands of AI agents.

Its fundamental design lives in the MGX rack architecture and will be supported by 50+ MGX system and component partners. NVIDIA plans to contribute the upgraded rack as well as the compute tray innovations as an open standard for the OCP consortium.

Its standards for compute trays and racks enable partners to mix and match in modular fashion and scale faster with the architecture. The Vera Rubin NVL144 rack design features energy-efficient 45°C liquid cooling, a new liquid-cooled busbar for higher performance and 20x more energy storage to keep power steady.

The MGX upgrades to compute tray and rack architecture boost AI factory performance while simplifying assembly, enabling a rapid ramp-up to gigawatt-scale AI infrastructure.

NVIDIA is a leading contributor to OCP standards across multiple hardware generations, including key portions of the NVIDIA GB200 NVL72 system electro-mechanical design. The same MGX rack footprint supports GB300 NVL72 and will support Vera Rubin NVL144, Vera Rubin NVL144 CPX and Vera Rubin CPX for higher performance and fast deployments.

If You Build It, They Will Come: NVIDIA Kyber Rack Server Generation

The OCP ecosystem is also preparing for NVIDIA Kyber, featuring innovations in 800 VDC power delivery, liquid cooling and mechanical design.

These innovations will support the move to rack server generation NVIDIA Kyber — the successor to NVIDIA Oberon — which will house a high-density platform of 576 NVIDIA Rubin Ultra GPUs by 2027.

The most effective way to counter the challenges of high-power distribution is to increase the voltage. Transitioning from a traditional 415 or 480 VAC three-phase system to an 800 VDC architecture offers various benefits.

The transition afoot enables rack server partners to move from 54 VDC in-rack components to 800 VDC for better results. An ecosystem of direct current infrastructure providers, power system and cooling partners, and silicon makers — all aligned on open standards for the MGX rack server reference architecture — attended the event.

NVIDIA Kyber is engineered to boost rack GPU density, scale up network size and maximize performance for large-scale AI infrastructure. By rotating compute blades vertically, like books on a shelf, Kyber enables up to 18 compute blades per chassis, while purpose-built NVIDIA NVLink switch blades are integrated at the back via a cable-free midplane for seamless scale-up networking.

Over 150% more power is transmitted through the same copper with 800 VDC, enabling eliminating the need for 200-kg copper busbars to feed a single rack.

Kyber will become a foundational element of hyperscale AI data centers, enabling superior performance, efficiency and reliability for state-of-the-art generative AI workloads in the coming years. NVIDIA Kyber racks offer a way for customers to reduce the amount of copper they use by the tons, leading to millions of dollars in cost savings.

NVIDIA NVLink Fusion Ecosystem Expands

In addition to hardware, NVIDIA NVLink Fusion is gaining momentum, enabling companies to seamlessly integrate their semi-custom silicon into highly optimized and widely deployed data center architecture, reducing complexity and accelerating time to market.

Intel and Samsung Foundry are joining the NVLink Fusion ecosystem that includes custom silicon designers, CPU and IP partners, so that AI factories can scale up quickly to handle demanding workloads for model training and agentic AI inference.

  • As part of the recently announced NVIDIA and Intel collaborationIntel will build x86 CPUs that integrate into NVIDIA infrastructure platforms using NVLink Fusion.
  • Samsung Foundry has partnered with NVIDIA to meet growing demand for custom CPUs and custom XPUs, offering design-to-manufacturing experience for custom silicon.

It Takes an Open Ecosystem: Scaling the Next Generation of AI Factories

More than 20 NVIDIA partners are helping deliver rack servers with open standards, enabling the future gigawatt AI factories.

Learn more about NVIDIA and the Open Compute Project at the OCP Global Summittaking place at the San Jose Convention Center from Oct. 13-16.

How will the Gigawatt AI Factories specifically enhance the processing of the 20 terabytes of data generated nightly by the vera Rubin Observatory?

NVIDIA and Partners Propel Advancement of Next-Generation Gigawatt AI Factories for Vera Rubin Telescope Project

The Vera Rubin Observatory and the Data Deluge

the vera C. Rubin Observatory, currently under construction in Chile, promises to revolutionize our understanding of the universe. This enterprising project, formerly known as the Large Synoptic Survey Telescope (LSST), will generate an unprecedented volume of astronomical data – estimated at 20 terabytes per night.Managing, processing, and analyzing this massive dataset requires a paradigm shift in computational infrastructure, leading to the development of specialized “Gigawatt AI Factories.” These aren’t just about raw processing power; they’re about intelligent data handling powered by artificial intelligence (AI) and machine learning (ML).

NVIDIA’s Role: Accelerating Astronomical Finding

NVIDIA is at the forefront of this technological leap, collaborating with key partners to build these next-generation facilities. The core of these AI Factories revolves around NVIDIA’s high-performance computing (HPC) platforms, specifically leveraging GPU acceleration for both conventional high-throughput computing and cutting-edge AI workloads.

Here’s a breakdown of NVIDIA’s contributions:

* GPU Technology: Utilizing the latest NVIDIA H100 and future generations of GPUs to accelerate data processing pipelines. These GPUs are optimized for both deep learning and scientific computing.

* NVIDIA DGX Systems: Deploying NVIDIA DGX systems – integrated platforms combining GPUs, CPUs, and high-bandwidth networking – to provide a scalable and efficient foundation for the AI Factories.

* Software Stack: Providing a comprehensive software stack, including CUDA, cuDNN, and TensorRT, to optimize AI models and accelerate inference. This is crucial for real-time event detection and anomaly identification within the Rubin Observatory’s data stream.

* Networking Solutions: Implementing NVIDIA’s networking technologies,like InfiniBand,to ensure high-speed data transfer between compute nodes and storage systems. This minimizes bottlenecks and maximizes overall performance.

Partner Ecosystem: Building the Infrastructure

NVIDIA isn’t working in isolation. A robust partner ecosystem is critical to realizing the vision of Gigawatt AI Factories. Key collaborators include:

* Dell Technologies: Providing the overall infrastructure, including servers, storage, and networking solutions, optimized for NVIDIA’s GPUs. Dell’s expertise in large-scale deployments is invaluable.

* Microsoft Azure: Offering cloud-based resources and services to supplement on-premise infrastructure, providing scalability and versatility for data storage and analysis. Cloud computing plays a vital role in handling peak workloads.

* SLAC National Accelerator Laboratory: Leading the development of the Rubin Observatory’s data management system and the associated software pipelines. SLAC’s expertise in scientific data processing is essential.

* Broadcom: Supplying high-speed networking components to enable seamless data transfer within the AI Factories.

The Gigawatt Scale: Powering the future of astronomy

The term “Gigawatt” isn’t hyperbole. These AI Factories will consume significant amounts of power – on the order of several megawatts, potentially scaling to a gigawatt – to operate at the required performance levels. This necessitates innovative approaches to power delivery and cooling.

* Liquid cooling: Implementing advanced liquid cooling systems to efficiently dissipate heat generated by the high-density GPU clusters. This reduces energy consumption and improves reliability.

* Power Management: Utilizing intelligent power management techniques to optimize energy usage and minimize waste.

* sustainable Energy sources: Exploring the use of renewable energy sources, such as solar and wind power, to reduce the carbon footprint of the AI Factories.

AI Applications in the Vera Rubin Observatory Project

The AI Factories aren’t just about processing speed; they’re about unlocking new scientific discoveries. Here are some key applications of AI and ML in the Rubin Observatory project:

* Real-time Transient Detection: Identifying rapidly changing astronomical events, such as supernovae and gamma-ray bursts, in real-time. This allows for immediate follow-up observations with other telescopes. Time-domain astronomy is a major focus.

* Anomaly detection: Identifying unusual or unexpected patterns in the data that may indicate new phenomena or errors in the data pipeline.

* Object Classification: Automatically classifying astronomical objects, such as galaxies, stars, and asteroids, based on their characteristics. Image recognition is a core component.

* Weak Gravitational Lensing Analysis: Measuring the subtle distortions of galaxy shapes caused by the gravity of intervening matter. This provides insights into the distribution of dark matter.

* Cosmic Microwave Background (CMB) Studies: Analyzing the Rubin Observatory’s data to refine our understanding of the CMB and the early universe.

Benefits of AI-Powered Data Processing

The adoption of AI and GPU acceleration offers significant benefits for the Vera Rubin Observatory project:

* Faster Scientific discovery: Accelerating the pace of astronomical research by enabling faster data processing and analysis.

* Improved Data Quality: Enhancing the accuracy and reliability of the data through AI-powered anomaly detection and error correction.

* new Scientific Insights: Uncovering new patterns and relationships in the data that would be impossible to detect with traditional methods.

* Scalability and Flexibility: Providing a scalable and flexible infrastructure that can adapt to the evolving needs of the project.

* Cost Efficiency: Optimizing energy usage and reducing operational costs through intelligent power management and efficient cooling systems.

##

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Adblock Detected

Please support us by disabling your AdBlocker extension from your browsers for our website.