AMD Dives Deep on CDNA 4 Architecture and MI350 Accelerator at Hot Chips 2025

AMD MI350 Accelerator

The second big machine learning accelerator talk of the afternoon belongs to AMD. The company’s chip architects are at this year’s show to tell the audience all about the CDNA 4 architecture, which is powering AMD’s new MI350 family of accelerators.

Like it’s MI300 predecessor, AMD is using 3D die stacking to build up a formidable chip, layering up to 8 accelerator complex dies (XCDs) on top of a pair of I/O dies, making for a 185 billion transistor behemoth.

Large Language Model usage is booming. And AMD is here to ride the wave of demand for hardware.

Models are getting more and more complex. LLMs are getting long, but also reasoning models require longer context lengths.

Keeping these models performant requires even more memory bandwidth and capacity, not to mention remaining power efficient. And, of course, being able to cluster multiple GPUs to house the largest models.

MI350 was delivered this year, with AMD noting how they’re right on schedule for their roadmap.

MI350 is used in two platforms: MI350X for air cooled systems, and MI355X for liquid cooled systems.

MI350 uses 185B transistors, with AMD continuing to use chiplets and die stacking. Like MI300, the compute dies sit on top of the base dies, with 4 compute dies per base die.

The total board power for a liquid cooled system is 1.4kW.

The I/O die is still on 6nm and AMD says there are few benefits to trying to build the base dies on a smaller process.

Meanwhile the compute dies are built using TSMC’s latest-generation 3nm N3P node, in order to optimize per-per-watt.

Diving into the I/O die, the infinity fabric has been changed to accommodate the fewer base dies used in MI350. 2 dies has reduced the number of chip-to-chip die crossings. And it allows for wider, lower clocked D2D connections, ensuring power efficiency.

There are 7 IF links per socket.

Overall, IF 4 offers 2TB/sec more bandwidth than IF 3 used in MI300. And the large memory capacity allows for fewer GPUs overall, cutting down on the amount of synchronization required.

Looking at the cache and memory hierarchy, the LDS has been doubled versus MI300.

4 compute dies can go on each of the new, larger I/O dies. For 8 compute dies in each MI350. Peak engine clock of 2.4GHz. And each XCD has a 4MB L2 cache that is coherent with the other XCDs.

CDNA 4 architecture nearly doubles throughput for many data types. And it introduces hardware support for FP6 and FP4 data types.

Supported Data Formats Performance Comparison

By nearly doubling the math throughput for AI datatypes, AMD reckons they’re upwards of 2x faster than competitive accelerators.

And here’s an SoC logical block diagram, illustrating how the infinity fabric, infinity cache, memory, and XCDs come together.

Shifting gears, AMD is focusing to a platform-level view of the hardware, and how those GPUs are used to build up to complete systems.

A MI350 can be configured as single a single NUMA domain, or two NUMA domains.

There is a latency hit to going to HBM that’s attached to another base die. Which is where two NUMA domains comes in, to restrict an XCD’s access to just its local memory.

Separate from the memory partitioning options, the XCDs can also be split into multiple compute partitions. Anywhere from a single domain to making each XCD its own GPU.

And going bigger still, a multi-socket system allows for up to 8 GPUs on a single base board. Infinity Fabric is used to link up the GPUs in an all-to-all topology, while PCIe is used to connect to the host CPU and NICs.

AMD uses standard OAM modules to house MI350 GPUs.

With up to 8 of these modules on a universal baseboard (UBB)

MI350X is a drop-in upgrade for existing air-cooled MI300 and MI325 systems.

Meanwhile the liquid cooled MI355X platform offers even higher performance, at a cost of a 1.4kW TDP per GPU. This is still using OAM modules, but with smaller direct liquid cooling cold plates in place of large air heatsinks.

Both MI350 platforms have the same memory capacity and bandwidth. But they differ in compute performance, reflecting the difference in clockspeeds.

And for hyperscalers, liquid cooled racks can be configured up to 96 or 128 GPUs per rack. While air-cooled options will support 64 GPUs per rack.

And when you need a whole rack, AMD offers a reference rack solution where all of the major chips are from AMD. GPU, CPU, and scale-out NICs.

AMD’s ROCm software has slowly been coming in to its own. And software-based performance gains are just as important as hardware performance gains in boosting overall performance.

And here are a few slides looking at performance for both inference and training.

Once again reiterating AMD’s roadmap, and AMD’s ability to reliably deliver on it. And that will extend to MI400 next year.

MI400 will deliver up to 10x more performance for AI frontier models next year.

And that’s the MI350/CDNA 4 recap at Hot Chips 2025. MI350 has started shipping to AMD’s partners, so the company is very eager to see what it can do as manufacturing ramps up over the next few quarters.

AMD Dives Deep on CDNA 4 Architecture and MI350 Accelerator at Hot Chips 2025

Continue Reading

More posts

The Influence of Characteristics and Indexes (NLR, PNI, and SII) Evalu

Other World Computing (OWC) Takes Home Two Future’s Best of IBC2025 Show Awards, Presented by TVBEurope – Business Wire

News | RTX’s Collins Aerospace secures FlightSense™ and Dispatch℠ contracts with China Airlines

Looking at the effects of cold on basil and lettuce