Accelerating on-device AI: A look at Arm and Google AI Edge optimization Arm's Scalable Matrix Extension 2 (SME2) integrates a matrix-compute unit into the CPU, enabling up to 5x faster inference for generative AI workloads on mobile devices. It highlights how Google AI Edge, through tools like LiteRT and XNNPACK, automatically leverages Arm SME2 to streamline development, allowing developers to convert, optimize, and deploy models like Stability AI's stable-audio-open-small onto Arm CPUs without writing low-level code. AI is evolving beyond simple text interactions toward rich multimodal capabilities, such as on-device image and audio generation, enabling developers to create highly personalized consumer experiences. While the CPU has always been the ubiquitous option for inference, running large complex models at the edge has historically required choosing between high-latency CPU execution and fragmented, specialized accelerators. Arm Scalable Matrix Extension 2 SME2 eliminates this tradeoff by integrating a dedicated matrix-compute unit directly into the CPU cluster. This architecture enables the CPU to function as a high-performance AI accelerator, delivering up to 5x faster inference for the matrix-heavy workloads at the heart of generative AI. Running on-device AI on Arm hardware is dramatically streamlined with Google AI Edge, an integrated stack designed to simplify your development journey. LiteRT automatically leverages Arm SME2 at runtime through XNNPACK and Arm KleidiAI integration. It identifies and selects math-intensive kernels like iGeMM and GeMM, delivering specialized hardware acceleration. To further ease deployment, AI Edge Quantizer handles complex model compression, and Model Explorer provides a visual map to quickly identify and resolve performance hotspots. The power of this integration is proven through deploying Stability AI’s stable-audio-open-small model entirely on Arm CPUs delivering major performance uplift. In this blog post, we’ll walk you through transforming the original floating-point PyTorch stable-audio-open-small model into a highly optimized, mixed-precision FP16/Int8 implementation ready for high-performance acceleration on Arm CPU. Link to Youtube Video visible only when JS is disabled To generate high-quality audio, such as 11-second stereo clips from a single prompt, directly on a wide range of mobile devices, practical considerations usually require a manageable model footprint, typically around 1 billion parameters. Even within this Small Language Model SLM range, developers face Challenging Deployment Hurdles: By using a diffusion-based model as the optimization target, we demonstrate a complete end-to-end path with the Google AI Edge software stack. As shown below, this synergy provides a streamlined Convert → Optimize → Deploy pipeline. Given the KleidiAI optimizations are embedded directly into XNNPACK, developers gain specialized AI acceleration automatically. There is no need to write low-level assembly or custom hardware code; the stack handles the "translation" from high-level model to silicon-optimized execution. Start by converting the PyTorch version of the Stable-audio-open-small model into the AI Edge ecosystem. LiteRT-Torch allows for a direct conversion path for PyTorch models, minimizing friction of moving from a research environment to a production mobile environment. import litert torch from litert torch.quantize import quant config from litert torch.generative.quantize import quant recipe, quant recipe utils Specify the quantization format quant config int8 = quant config.QuantConfig generative recipe=quant recipe.GenerativeQuantRecipe default=quant recipe utils.create layer quant dynamic , Initiate the conversion edge model = ai edge torch.convert model, example inputs, quant config=quant config int8 Find the code snippet to illustrate how LiteRT-Torch works in practice here Previously, identifying which layers of a model were suitable for quantization was a manual, error-prone process of inspecting individual layers. With Google’s Model Explorer, developers can now visualize the entire model graph. The new node data overlay plugin allows us to see exactly which operators are most compute-intensive or as shown below which are "quantization-safe". This visual verification ensures we only target layers where moving to INT8 won't degrade audio output quality. For example, to improve the inference efficiency of the diffusion step, we applied dynamic INT8 quantization to the DiT Diffusion Transformers submodule: As shown in the screenshot above, all layers in the DiT submodule are green, indicating low error values within the DiT transformer FP32 vs. FP32+INT8 . Therefore, we expect the dynamically quantized INT8 DiT submodule to achieve quality comparable to FP32. Once the suitability of INT8 quantization was confirmed, we utilized the AI Edge Quantizer to optimize the model from FP32 to INT8. This decision resulted in 3x performance improvement in the DiT submodule, along with a 4x reduction of its memory usage. fp32 model path = "./dit model fp32.tflite" dynamic quant model path = "./dit model int8+fp32.tflite" the recipe = dict { 'regex': '. ', 'operation': ' ', 'algorithm key': 'min max uniform quantize', 'op config': { 'weight tensor config': { 'num bits': 8, 'symmetric': True, 'granularity': 'CHANNELWISE', 'dtype': 'INT', 'block size': 0, }, 'compute precision': 'INTEGER', 'explicit dequantize': False, 'skip checks': False, 'min weight elements': 0 }, } Define the quantizer, with fp32 tflite model, and the recipe. qt = quantizer.Quantizer fp32 model path, the recipe quant result = qt.quantize .export model dynamic quant model path, overwrite=True The final step is the runtime. When you run this quantized model in LiteRT on an Android mobile device, it defaults to the XNNPACK delegate for CPU inference. Because XNNPACK integrates KleidiAI directly within the latest LiteRT API, developers get these optimizations automatically. These micro-kernels ensure that the core INT8 and FP16 matrix multiplications of the audio model run with maximum efficiency on the CPU. Below is a representative snippet of how LiteRT inference is implemented in C++ using the CompiledModel API. Instructions in this guide are provided for running the audiogen app with LiteRT either on an Android™ device or macOS®. include "litert/cc/litert compiled model.h" include "litert/cc/litert environment.h" include "litert/cc/litert tensor buffer.h" // 1. Initialize the LiteRT Environment auto env = litert::Environment::Create {} .value ; // 2. Create the CompiledModel from the .tflite file // Hardware acceleration e.g., SME2 via KleidiAI is handled automatically auto compiled model = litert::CompiledModel::Create env, "autoencoder model.tflite", litert::HwAccelerators::kCpu .value ; // 3. Prepare input and output buffers auto autoencoder inputs = compiled model.CreateInputBuffers .value ; auto autoencoder outputs = compiled model.CreateOutputBuffers .value ; // 4. Write input data e.g., random noise or conditioned embeddings auto auto in lock and ptr = scoped lock