The second machine learning presentation of the afternoon comes from d-Matrix. The company specializes in hardware for AI inference, and as of late has been tackling the matter of how to improve inference performance by using in-memory computing. Along those lines, the company is presenting its Corsair in-memory computing chiplet architecture at Hot Chips. As a quick note: we covered d-Matrix Pavehawk Brings 3DIMC to Challenge HBM for AI Inference a few days ago.
Not to be confused with that Corsair, d-Matrix claims that Corsair is the most efficient inference platform on the market, thanks to its combination of in-memory computing and low-latency interconnects.


Each token in an LLM is memory bound. All the weights need to be read. Batching allows for these weight fetches to be amortized.
d-Matrix’s goal is to reach saturation at moderate batch sizes in order to hit specific latency targets.

Real-time voice requires very low latency. Making it a good target for d-Matrix’s technology.

AI agents fall into the same boat. Multiple small models being executed to accomplish the desired task.

And here is Corair, d-Matrix’s accelerator. Two chips, each with 4 chiplets. Built on TSMC 6nm. 2GB of SRAM between all of the chiplets. This is a PCIe 5.0 x16 card, so it can be easily added to standard servers.
Meanwhile at the top of the card are bridge connectors to tie together multiple cards.
Each chiplet interfaces with LPDDR5X, with 256GB of L5X per card.

And here is how the chiplets are organized into slices. Around the edge are LPDDR and D2D connections. As well as 16 lanes of PCIe.

Two cards can be passively bridged together, making for a 16 chiplet cluster, with all-to-all connectivity.

8 cards in turn can go into a standard server, such as a Supermicro X14. In this example there are also 4 NIC cards to offer scale-up capabilities.

Corsair was built for low-latency batch throughput inference.
They support block floating point number formats. Energy efficiency is 38 TOPS per Watt.

The dispatch engine within each chiplet is based on RISC-V. 1 chiplet is split up into 4 quads. About 1TB/sec of D2D bandwidth.

Diving deeper, the matrix multiplier inside Corsair can perform a 64×64 matmul with INT8. Or 64×128 with INT4.

Corsair also supports FP formats with scale factors. As well as structured sparsity – though it’s only used for compression. Overall, it gets d-Matrix to 5x compression.

All 8 matrix units can be tied together.

Dataflow. Accumulated on the fly and then converted to the desired output format.

As for memory, there is a stash memory that feeds the cores. Each stash is 6MB. There are 2 LPDDR channels per chiplet.

When you have high memory bandwidth, the collective latency becomes increasingly critical.

So in order to do a 16 chiplet all-to-all connection, d-Matrix got latency down to 115ns D2D. Even going through PCIe switches, they can still hold latency to 650ns.

Another shot of Corsair chiplets on an organic package.

And here is the NIC that d-Matrix uses for scale-out fabrics. 2us of latency.

Using this, d-Matrix can rack and stack many servers.


And no inference accelerator would be complete without a matching software stack to enable the hardware and its features.

And here’s a look at power consumption. 275W @ 800MHz. Meanwhile 1.2GHz chugs 550W. Higher clockspeeds are worse for overall efficiency, but not immensely so.

And here are some Llama3 performance figures. The time per output token is just 2ms even for the larger Llama3-70B.

Underneath the chip, d-Matrix uses a silicon interposer with capacitor for power reliabiltiy reasons. d-Matrix goes one further and 3D stacks DRAM beneath their Corsair chiplets, keeping the local memory very, very close.

And they have a prototype 3D DRAM test vehicle that’s been built. 36 micron D2D stacking. The logic die sits on top, while the DRAM sits underneath.
How does d-Matrix make stacked DRAM + logic work? Keep the heat density to under 0.3W/mm2, which keeps from heating up the DRAM too much.