How Arm & Google Achieve 5x Faster On-Device AI

The world of Artificial Intelligence is rapidly moving beyond simple text conversations, embracing rich, multimodal experiences. Imagine generating stunning images or immersive audio directly on your device, enabling developers to craft incredibly personalized consumer interactions. This exciting evolution has, however, presented a challenge for edge computing.

Historically, running large, complex AI models on mobile devices forced a difficult choice: either endure high-latency execution on the CPU or navigate the complexities of fragmented, specialized accelerators. Fortunately, a new architectural leap is changing this landscape, bringing high-performance AI directly to the CPU.

Unleashing AI Performance with Arm SME2 and Google AI Edge

Arm’s innovative Scalable Matrix Extension 2 (SME2) directly addresses this dilemma by integrating a dedicated matrix-compute unit right into the CPU cluster. This ingenious design transforms the CPU into a powerful AI accelerator, capable of delivering up to 5x faster inference for the demanding matrix-heavy workloads that define generative AI models.

The journey to deploying sophisticated on-device AI on Arm hardware is now dramatically simpler, thanks to the integrated Google AI Edge stack. This powerful solution is meticulously designed to streamline every step of your development journey, from model conversion to optimized deployment. Together, Arm SME2 and Google AI Edge offer an unparalleled synergy for modern AI applications.

Within Google AI Edge, LiteRT plays a crucial role, automatically leveraging Arm SME2 at runtime through its deep integration with XNNPACK and Arm KleidiAI. It intelligently identifies and accelerates math-intensive operations like iGeMM and GeMM, ensuring specialized hardware performance. The AI Edge Quantizer expertly handles complex model compression, while Model Explorer provides an intuitive visual map to quickly pinpoint and resolve any performance bottlenecks.

Streamlining On-Device AI Development: The Convert-Optimize-Deploy Pipeline

The combined power of this integration is brilliantly demonstrated through the successful deployment of Stability AI’s Stable Audio Open Small model, running entirely on Arm CPUs. This achievement showcases a significant performance uplift, transforming a complex floating-point PyTorch model into a highly optimized, mixed-precision (FP16/Int8) implementation ready for high-performance acceleration.

Generating high-quality audio, such as intricate 11-second stereo clips from a single prompt, directly on a wide array of mobile devices usually demands a manageable model footprint, typically around one billion parameters. Even within this Small Language Model (SLM) range, developers often face challenging deployment hurdles – precisely where the Google AI Edge software stack shines.

By optimizing a diffusion-based model, we illustrate a complete, end-to-end path with Google AI Edge, providing a seamless workflow:

Convert: Transform your PyTorch model into the AI Edge ecosystem.
Optimize: Enhance model efficiency through quantization and compression.
Deploy: Run your optimized model with high performance on Arm hardware.

Since KleidiAI optimizations are deeply embedded directly into XNNPACK, developers automatically gain access to specialized AI acceleration. This means there’s no need to write intricate low-level assembly or custom hardware code; the stack intelligently manages the “translation” from your high-level model to silicon-optimized execution.

Convert & Optimize Your AI Models

The first step involves converting your PyTorch version of the Stable Audio Open Small model into the AI Edge ecosystem. LiteRT-Torch offers a direct and friction-free conversion path, simplifying the transition from research to production mobile environments. Detailed code examples are available within the linked documentation.

Previously, identifying which specific layers of a model were suitable for quantization was a manual, often error-prone task. Now, with Google’s Model Explorer, developers can visually analyze the entire model graph, using its node data overlay plugin to see exactly which operators are most compute-intensive or “quantization-safe.”

This visual verification ensures you only target layers where moving to INT8 precision won’t degrade the crucial audio output quality. For instance, dynamic INT8 quantization was applied to the DiT (Diffusion Transformers) submodule to boost inference efficiency, and Model Explorer confirmed its layers were “green” for quantization, ensuring FP32-comparable quality.

Once INT8 quantization suitability was confirmed, we efficiently utilized the AI Edge Quantizer to optimize the model from FP32 to INT8. This strategic decision led to a remarkable 3x performance improvement in the DiT submodule, accompanied by a significant 4x reduction in its memory usage. Code snippets demonstrating the quantizer’s use are available in the official documentation.

Deploy: High-Performance Inference with LiteRT

The final and crucial step is the runtime deployment. When you execute this quantized model in LiteRT on an Android mobile device, it intelligently defaults to the XNNPACK delegate for CPU inference. Because XNNPACK integrates KleidiAI directly, developers automatically benefit from these advanced optimizations.

These specialized micro-kernels ensure that the core INT8 and FP16 matrix multiplications, essential for the audio model, run with maximum efficiency directly on the CPU. A representative snippet of how LiteRT inference is implemented in C++ using the CompiledModel API, suitable for both Android™ and macOS®, is provided in the documentation.

Real-World Impact: Faster, Smaller, High-Quality Audio Generation

To fully appreciate the impact of these advancements, we benchmarked our optimized FP16 + INT8 model against the original FP32 Stable Audio Open Small model. Tests were conducted on an SME2-based Android device and an Apple MacBook with M4, analyzing both single-threaded and multi-threaded CPU performance.

The results are compelling: SME2 delivers more than a 2x performance improvement over the NEON instruction set, which is specifically designed for signal processing tasks. Even with a single core, the optimized model can generate 11 seconds of high-quality audio in under 8 seconds. This performance is well within acceptable limits for a smooth user experience, proving that powerful generative AI can indeed run efficiently at the edge.

These groundbreaking optimizations are readily available for developers today. You can begin experimenting immediately using Google AI Edge tools and the KleidiAI-accelerated LiteRT. Explore Arm’s sample repository to access the complete end-to-end journey for Stable Audio Open and start building the future of on-device AI.

Source: Google Developers Blog

Kristine Vior

With a deep passion for the intersection of technology and digital media, Kristine leads the editorial vision of HubNextera News. Her expertise lies in deciphering technical roadmaps and translating them into comprehensive news reports for a global audience. Every article is reviewed by Kristine to ensure it meets our standards for original perspective and technical depth.