Arm Scalable Matrix Extension 2 Coming to Android To Accelerate On-Device AI

Available in the Armv9-A architecture, Arm Scalable Matrix Extension 2 (SME2) is a set of advanced CPU instructions designed to accelerate matrix heavy computation. The new Arm technology aims to help mobile developers to run advanced AI models directly on CPU with improved performance and efficiency, without requiring any changes to their apps.

SME2 builds on the previously available SME extension, which introduced matrix operations and streaming vectors, by adding acceleration and support for multi-vector data-processing instructions, load to and store from multi-vectors, and a multi-vector predication mechanism.

While the performance benefits of SME2 are already available on the latest iOS devices and Apple M4-series chips, they will soon reach Android devices as well, says Alex Spinelli, Arm’s VP of AI and Developer Platforms and Services.

Matrix workflows are key for real-time mobile inference tasks such as image and language processing and voice generation. In particular, comparisons between SME2-enabled and non-SME2-enabled workflows shows a significant improvement, says Arm:

On SME2-enabled hardware, Google’s Gemma 3 model delivers 6x faster chat responses, and can start summarizing up to 800 words in under a second on a single CPU core.

Likewise, a 2.6x speed up has been measured for prompt processing on a vivo X200 Pro flagship smartphone running a 3.8B parameter Phi-3 Mini model.

To help developers take advantage of SME2, Arm provides a library called KleidiAI, which is integrated in Google’s XNNPACK. XNNPACK powers several machine learning and AI frameworks, including Alibaba’s MNN, Google’s LiteRT, Microsoft’s ONNX Runtime, and llama.cpp.

When SME2 is enabled and compatible, XNNPACK automatically routes the matrix heavy operations to SME2 via KleidiAI, so developers directly benefit with no changes needed in application logic or infrastructure.

KleidiAI is designed to be integrated easily into C and C++ codebases thanks to its micro-kernel based architecture.

A micro-kernel, in Arm’s parlance, refers to the “near-minimum amount of software to accelerate a given ML operator with high performance”, such as for example, packing or matrix multiplication. A key detail to explain why a micro-kernel is not simply a function, is that each micro-kernel processes only a portion of the output tensor, enabling the full operation to be dispatched across multiple threads.

In addition, KleidiAI has other features that will be welcome to developers, including it not relying on external dependencies, not using dynamic memory or requiring memory management, and a highly modular design where each micro-kernel is a stand-alone library consisting only of .c and .h files.

To help developers take advantage of SME2, Arm has released additional resources showcasing real-world examples of LLM-based apps using LiteRT, MNN, PyTorch and other supported frameworks.


Continue Reading