Nvidia details GB10 miniaturized Grace Blackwell superchips • The Register

Hot Chips Back in 2023, Nvidia’s superchip architecture introduced a new programming model for accelerated workloads by coupling the CPU to the GPU via a high-speed NVLink fabric that makes PCIe feel positively glacial.

The only problem? Outside of the datacenter or cloud, there weren’t a lot of ways for developers to take advantage of it.

Nvidia’s Project Digits — since rebranded as DGX Spark — aims to change that by bringing a miniaturized version of Nvidia’s superchip architecture called the GB10 to the masses — or at least to devs with north of $2,999 burning a hole in their pockets.

At Hot Chips this week, GB10 lead architect Andi Skende offered a closer look at its architecture.

Fabbed on TSMC’s 3nm manufacturing tech, the GB10 is composed of two distinct compute dies: a CPU tile designed by MediaTek, and a GPU tile designed by Nvidia. These two dies are stitched together using TSMC’s 2.5D advanced packaging tech and connected via Nvidia’s proprietary NVLink Chip-to-Chip interconnect, which provides 600GB/s of bidirectional bandwidth.

Here's a breakdown of the IP making up the GB10. Everything is orange was developed by MediaTek while the green shows elements built by Nvidia

Here’s a breakdown of the IP making up the GB10. Everything in orange was developed by MediaTek while the green shows elements built by Nvidia – Click to enlarge

The CPU die or S-die houses 20 Arm v9.2 cores across two clusters in a big-little arrangement with an equal number of X925 and Cortex A725 cores. These compute clusters are fed by 32MB of L3 (16MB per cluster) as well as an additional 16MB of L4 cache designed to smooth out communications between the GB10’s compute engines.

Details on the GB10’s graphics die or G-die, unfortunately, remain rather thin. Nvidia tells us that the chip will deliver roughly 1 petaFLOP of peak FP4 performance with sparsity or about 31 teraFLOPS of single precision compute (FP32).

That puts the GB10’s, and by extension the Spark’s, AI performance roughly on par with an RTX 5070, which we’ll note has an MSRP of about $550. However, floating point performance doesn’t tell the full story.

For one, the GB10 is a lot more power-efficient. While the RTX 5070 has a TDP of 250 watts, the GB10 is rated for just 140 watts.

The GB10 is also equipped with 128GB of VRAM as compared to the 5070’s 12GB. Ample VRAM capacity is essential for the kinds of workloads the DGX Spark is designed for, as even at FP4 precision, model weights still require about 500MB for every billion parameters.

Unlike its bigger siblings, the GB200 and GB300, the GB10 doesn’t use ultra-fast HBM. Instead, due to power, and no doubt cost constraints, Nvidia has opted for LPDDR5x memory clocked at a relatively-speedy 9400MT/s.

Because that memory is combined with the CPU die’s 256-bit memory bus, the GB10 delivers somewhere between 273GB/s and 301GB/s of bandwidth. As a reminder, memory bandwidth is a key indicator of inference performance — the faster your memory the faster the chip can turn out tokens. The decision to use LPDDR shows Nvidia has clearly had to make a compromise between memory capacity and bandwidth here.

Having said that, the DGX Spark is designed for a lot more than just running local models. Nvidia is positioning the miniature AI workstation as a development platform for prototyping and model fine tuning in addition to local inference.

Fine tuning, as we’ve previously explored, is a particularly compute and memory-intensive task even when using Low Rank Adaptation and quantization to minimize compute requirements. In this scenario, compute and memory capacity are more important than bandwidth.

According to Nvidia, the Spark’s 128GB of LPDDR5x is enough to fine tune a 70 billion parameter model and run inferencing on ones up to 200 billion parameters.

If you do need more capacity, the GB10 is paired with a ConnectX-7 NIC with a pair of 200GbE ports that allow workloads to be distributed across a pair of DGX Sparks, effectively doubling its fine tuning and inference capabilities.

Perhaps more importantly, because the GB10 is based on the same technologies as its datacenter siblings, workloads developed on the miniaturized workstation don’t need to be refactored for production deployment. ®

Nvidia details GB10 miniaturized Grace Blackwell superchips • The Register

Continue Reading

More posts

Associate Director, Fluid Biomarkers, Clinical Development Biomarkers job with Biogen

Lawmakers: Ensure Access to Vaccines

Indian Ocean Chikungunya Outbreak Remains at Level 2 Alert — Vax-Before-Travel

Flash Report: August Jobless Rate Rises as Employment Conditions Soften Further – Federal Reserve Bank of St. Louis