AMD Dives Deep on CDNA 4 Architecture and MI350 Accelerator at Hot Chips 2025

AMD MI350 Accelerator

The second big machine learning accelerator talk of the afternoon belongs to AMD. The company’s chip architects are at this year’s show to tell the audience all about the CDNA 4 architecture, which is powering AMD’s new MI350 family of accelerators.

Like it’s MI300 predecessor, AMD is using 3D die stacking to build up a formidable chip, layering up to 8 accelerator complex dies (XCDs) on top of a pair of I/O dies, making for a 185 billion transistor behemoth.

Large Language Models: Explosive Growth
Large Language Models: Explosive Growth

Large Language Model usage is booming. And AMD is here to ride the wave of demand for hardware.

Models are getting more and more complex. LLMs are getting long, but also reasoning models require longer context lengths.

GenAI Needs
GenAI Needs

Keeping these models performant requires even more memory bandwidth and capacity, not to mention remaining power efficient. And, of course, being able to cluster multiple GPUs to house the largest models.

Instinct MI350 Series
Instinct MI350 Series

MI350 was delivered this year, with AMD noting how they’re right on schedule for their roadmap.

MI350 Architecture Enhancements
MI350 Architecture Enhancements

MI350 is used in two platforms: MI350X for air cooled systems, and MI355X for liquid cooled systems.

MI350 GPU
MI350 GPU

MI350 uses 185B transistors, with AMD continuing to use chiplets and die stacking. Like MI300, the compute dies sit on top of the base dies, with 4 compute dies per base die.

The total board power for a liquid cooled system is 1.4kW.

The I/O die is still on 6nm and AMD says there are few benefits to trying to build the base dies on a smaller process.

Meanwhile the compute dies are built using TSMC’s latest-generation 3nm N3P node, in order to optimize per-per-watt.

MI350 GPU Chiplets
MI350 GPU Chiplets

Diving into the I/O die, the infinity fabric has been changed to accommodate the fewer base dies used in MI350. 2 dies has reduced the number of chip-to-chip die crossings. And it allows for wider, lower clocked D2D connections, ensuring power efficiency.

There are 7 IF links per socket.

MI350 GPU Metrics
MI350 GPU Metrics

Overall, IF 4 offers 2TB/sec more bandwidth than IF 3 used in MI300. And the large memory capacity allows for fewer GPUs overall, cutting down on the amount of synchronization required.

MI350 GPU Cache & Hierarchy
MI350 GPU Cache & Hierarchy

Looking at the cache and memory hierarchy, the LDS has been doubled versus MI300.

Accelerator Complex Die (XCD)
Accelerator Complex Die (XCD)

4 compute dies can go on each of the new, larger I/O dies. For 8 compute dies in each MI350. Peak engine clock of 2.4GHz. And each XCD has a 4MB L2 cache that is coherent with the other XCDs.

Supported Data Formats
Supported Data Formats

CDNA 4 architecture nearly doubles throughput for many data types. And it introduces hardware support for FP6 and FP4 data types.

Supported Data Formats Performance Comparison
Supported Data Formats Performance Comparison

By nearly doubling the math throughput for AI datatypes, AMD reckons they’re upwards of 2x faster than competitive accelerators.

SoC Block Diagram
SoC Block Diagram

And here’s an SoC logical block diagram, illustrating how the infinity fabric, infinity cache, memory, and XCDs come together.

Flexible GPU Partitioning
Flexible GPU Partitioning

Shifting gears, AMD is focusing to a platform-level view of the hardware, and how those GPUs are used to build up to complete systems.

A MI350 can be configured as single a single NUMA domain, or two NUMA domains.

There is a latency hit to going to HBM that’s attached to another base die. Which is where two NUMA domains comes in, to restrict an XCD’s access to just its local memory.

Flexible GPU Partitioning, Cont
Flexible GPU Partitioning, Cont

Separate from the memory partitioning options, the XCDs can also be split into multiple compute partitions. Anywhere from a single domain to making each XCD its own GPU.

Infinity Platform
Infinity Platform

And going bigger still, a multi-socket system allows for up to 8 GPUs on a single base board. Infinity Fabric is used to link up the GPUs in an all-to-all topology, while PCIe is used to connect to the host CPU and NICs.

Air Cooled OAM
Air Cooled OAM

AMD uses standard OAM modules to house MI350 GPUs.

Air Cooled UBB
Air Cooled UBB

With up to 8 of these modules on a universal baseboard (UBB)

Leveraging Existing DC Infrastructure
Leveraging Existing DC Infrastructure

MI350X is a drop-in upgrade for existing air-cooled MI300 and MI325 systems.

Liquid Cooling
Liquid Cooling

Meanwhile the liquid cooled MI355X platform offers even higher performance, at a cost of a 1.4kW TDP per GPU. This is still using OAM modules, but with smaller direct liquid cooling cold plates in place of large air heatsinks.

MI350X and MI3550X Platforms
MI350X and MI3550X Platforms

Both MI350 platforms have the same memory capacity and bandwidth. But they differ in compute performance, reflecting the difference in clockspeeds.

Rack-Scale Solutions
Rack-Scale Solutions

And for hyperscalers, liquid cooled racks can be configured up to 96 or 128 GPUs per rack. While air-cooled options will support 64 GPUs per rack.

Rack Infrastructure
Rack Infrastructure

And when you need a whole rack, AMD offers a reference rack solution where all of the major chips are from AMD. GPU, CPU, and scale-out NICs.

ROCm 7
ROCm 7

AMD’s ROCm software has slowly been coming in to its own. And software-based performance gains are just as important as hardware performance gains in boosting overall performance.

Inference Performance
Inference Performance
Large Inference Performance
Large Inference Performance
GPU Training Performance
GPU Training Performance

And here are a few slides looking at performance for both inference and training.

Annual Roadmap
Annual Roadmap

Once again reiterating AMD’s roadmap, and AMD’s ability to reliably deliver on it. And that will extend to MI400 next year.

Accelerating AI Compute Performance
Accelerating AI Compute Performance
Instinct MI400
Instinct MI400

MI400 will deliver up to 10x more performance for AI frontier models next year.

And that’s the MI350/CDNA 4 recap at Hot Chips 2025. MI350 has started shipping to AMD’s partners, so the company is very eager to see what it can do as manufacturing ramps up over the next few quarters.

Continue Reading