d-Matrix Corsair In-Memory Computing For AI Inference at Hot Chips 2025

Card Level Scaleup: 16 Chiplet Hierarchical All-to-All

The second machine learning presentation of the afternoon comes from d-Matrix. The company specializes in hardware for AI inference, and as of late has been tackling the matter of how to improve inference performance by using in-memory computing. Along those lines, the company is presenting its Corsair in-memory computing chiplet architecture at Hot Chips. As a quick note: we covered d-Matrix Pavehawk Brings 3DIMC to Challenge HBM for AI Inference a few days ago.

Not to be confused with that Corsair, d-Matrix claims that Corsair is the most efficient inference platform on the market, thanks to its combination of in-memory computing and low-latency interconnects.

Each token in an LLM is memory bound. All the weights need to be read. Batching allows for these weight fetches to be amortized.

d-Matrix’s goal is to reach saturation at moderate batch sizes in order to hit specific latency targets.

Voice is latency-critical and even more Memory Bound

Real-time voice requires very low latency. Making it a good target for d-Matrix’s technology.

AI Agents: SLM & Rise of Inference-Time Compute

AI agents fall into the same boat. Multiple small models being executed to accomplish the desired task.

d-Matrix Corsair: Chiplet-based Inference Acceleration Platform

And here is Corair, d-Matrix’s accelerator. Two chips, each with 4 chiplets. Built on TSMC 6nm. 2GB of SRAM between all of the chiplets. This is a PCIe 5.0 x16 card, so it can be easily added to standard servers.

Meanwhile at the top of the card are bridge connectors to tie together multiple cards.

Each chiplet interfaces with LPDDR5X, with 256GB of L5X per card.

And here is how the chiplets are organized into slices. Around the edge are LPDDR and D2D connections. As well as 16 lanes of PCIe.

Two cards can be passively bridged together, making for a 16 chiplet cluster, with all-to-all connectivity.

8 cards in turn can go into a standard server, such as a Supermicro X14. In this example there are also 4 NIC cards to offer scale-up capabilities.

Corsair Key Pillars – Low Latency, Batched Throughput

Corsair was built for low-latency batch throughput inference.

They support block floating point number formats. Energy efficiency is 38 TOPS per Watt.

Corsair Chiplet Built Using Modular Hardware Blocks

The dispatch engine within each chiplet is based on RISC-V. 1 chiplet is split up into 4 quads. About 1TB/sec of D2D bandwidth.

Diving deeper, the matrix multiplier inside Corsair can perform a 64×64 matmul with INT8. Or 64×128 with INT4.

Corsair also supports FP formats with scale factors. As well as structured sparsity – though it’s only used for compression. Overall, it gets d-Matrix to 5x compression.

All 8 matrix units can be tied together.

Dataflow with Block Floating-Point Numerical Formats

Dataflow. Accumulated on the fly and then converted to the desired output format.

Memory System: Global Memory, Stash, and LPDDR

As for memory, there is a stash memory that feeds the cores. Each stash is 6MB. There are 2 LPDDR channels per chiplet.

Scaling Challenges for Large-Model Inference

When you have high memory bandwidth, the collective latency becomes increasingly critical.

Corsair Scaleup – Hardware-Software Codesign

So in order to do a 16 chiplet all-to-all connection, d-Matrix got latency down to 115ns D2D. Even going through PCIe switches, they can still hold latency to 650ns.

Package Level Scaleup: 4 Chiplet All-to-All

Another shot of Corsair chiplets on an organic package.

And here is the NIC that d-Matrix uses for scale-out fabrics. 2us of latency.

Rack Level Scale-Out: Multi-node and Multi-Rack

Using this, d-Matrix can rack and stack many servers.

Aviator Software: Easy to use and optimized for Corsair

Aviator Software: Codesigned for LLM Acceleration

And no inference accelerator would be complete without a matching software stack to enable the hardware and its features.

And here’s a look at power consumption. 275W @ 800MHz. Meanwhile 1.2GHz chugs 550W. Higher clockspeeds are worse for overall efficiency, but not immensely so.

Performance and Flexibility for Use Cases

And here are some Llama3 performance figures. The time per output token is just 2ms even for the larger Llama3-70B.

Stacking of logic on DRAM interposer in 3D

Underneath the chip, d-Matrix uses a silicon interposer with capacitor for power reliabiltiy reasons. d-Matrix goes one further and 3D stacks DRAM beneath their Corsair chiplets, keeping the local memory very, very close.

And they have a prototype 3D DRAM test vehicle that’s been built. 36 micron D2D stacking. The logic die sits on top, while the DRAM sits underneath.

How does d-Matrix make stacked DRAM + logic work? Keep the heat density to under 0.3W/mm2, which keeps from heating up the DRAM too much.

d-Matrix Corsair In-Memory Computing For AI Inference at Hot Chips 2025

Continue Reading

More posts

Shark Teeth Are Vulnerable to Rising Ocean Acidification

CVD data gap disproportionately impacts minority races and ethnicities worldwide

Hospital governor resigns after cardiac care row in Torbay

Predictive role of the muscle quality index for testosterone deficiency in adult males based on interpretable machine learning methods | BMC Public Health