{"slug": "accelerating-on-device-ai-a-look-at-arm-and-google-ai-edge-optimization", "title": "Accelerating on-device AI: A look at Arm and Google AI Edge optimization", "summary": "Arm's Scalable Matrix Extension 2 (SME2) integrates a matrix-compute unit into the CPU, enabling up to 5x faster inference for generative AI workloads on mobile devices. It highlights how Google AI Edge, through tools like LiteRT and XNNPACK, automatically leverages Arm SME2 to streamline development, allowing developers to convert, optimize, and deploy models like Stability AI's stable-audio-open-small onto Arm CPUs without writing low-level code.", "body_md": "AI is evolving beyond simple text interactions toward rich multimodal capabilities, such as on-device image and audio generation, enabling developers to create highly personalized consumer experiences. While the CPU has always been the ubiquitous option for inference, running large complex models at the edge has historically required choosing between high-latency CPU execution and fragmented, specialized accelerators.\nArm Scalable Matrix Extension 2 (SME2) eliminates this tradeoff by integrating a dedicated matrix-compute unit directly into the CPU cluster. This architecture enables the CPU to function as a high-performance AI accelerator, delivering up to 5x faster inference for the matrix-heavy workloads at the heart of generative AI.\nRunning on-device AI on Arm hardware is dramatically streamlined with Google AI Edge, an integrated stack designed to simplify your development journey. LiteRT automatically leverages Arm SME2 at runtime through XNNPACK and Arm KleidiAI integration. It identifies and selects math-intensive kernels like iGeMM and GeMM, delivering specialized hardware acceleration. To further ease deployment, AI Edge Quantizer handles complex model compression, and Model Explorer provides a visual map to quickly identify and resolve performance hotspots.\nThe power of this integration is proven through deploying Stability AI’s stable-audio-open-small model entirely on Arm CPUs delivering major performance uplift. In this blog post, we’ll walk you through transforming the original floating-point PyTorch stable-audio-open-small model into a highly optimized, mixed-precision (FP16/Int8) implementation ready for high-performance acceleration on Arm CPU.\nLink to Youtube Video (visible only when JS is disabled)\nTo generate high-quality audio, such as 11-second stereo clips from a single prompt, directly on a wide range of mobile devices, practical considerations usually require a manageable model footprint, typically around 1 billion parameters. Even within this Small Language Model (SLM) range, developers face Challenging Deployment Hurdles:\nBy using a diffusion-based model as the optimization target, we demonstrate a complete end-to-end path with the Google AI Edge software stack. As shown below, this synergy provides a streamlined Convert → Optimize → Deploy pipeline.\nGiven the KleidiAI optimizations are embedded directly into XNNPACK, developers gain specialized AI acceleration automatically. There is no need to write low-level assembly or custom hardware code; the stack handles the \"translation\" from high-level model to silicon-optimized execution.\nStart by converting the PyTorch version of the Stable-audio-open-small model into the AI Edge ecosystem. LiteRT-Torch allows for a direct conversion path for PyTorch models, minimizing friction of moving from a research environment to a production mobile environment.\nimport litert_torch\nfrom litert_torch.quantize import quant_config\nfrom litert_torch.generative.quantize import quant_recipe, quant_recipe_utils\n# Specify the quantization format\nquant_config_int8 = quant_config.QuantConfig(\ngenerative_recipe=quant_recipe.GenerativeQuantRecipe(\ndefault=quant_recipe_utils.create_layer_quant_dynamic(),\n)\n)\n# Initiate the conversion\nedge_model = ai_edge_torch.convert(\nmodel, example_inputs, quant_config=quant_config_int8\n)\nFind the code snippet to illustrate how LiteRT-Torch works in practice here\nPreviously, identifying which layers of a model were suitable for quantization was a manual, error-prone process of inspecting individual layers.\nWith Google’s Model Explorer, developers can now visualize the entire model graph. The new node data overlay plugin allows us to see exactly which operators are most compute-intensive or as shown below which are \"quantization-safe\". This visual verification ensures we only target layers where moving to INT8 won't degrade audio output quality.\nFor example, to improve the inference efficiency of the diffusion step, we applied dynamic INT8 quantization to the DiT (Diffusion Transformers) submodule:\nAs shown in the screenshot above, all layers in the DiT submodule are green, indicating low error values within the DiT transformer (FP32 vs. FP32+INT8). Therefore, we expect the dynamically quantized INT8 DiT submodule to achieve quality comparable to FP32.\nOnce the suitability of INT8 quantization was confirmed, we utilized the AI Edge Quantizer to optimize the model from FP32 to INT8.\nThis decision resulted in 3x performance improvement in the DiT submodule, along with a 4x reduction of its memory usage.\nfp32_model_path = \"./dit_model_fp32.tflite\"\ndynamic_quant_model_path = \"./dit_model_int8+fp32.tflite\"\nthe_recipe = [\ndict({\n'regex': '.*',\n'operation': '*',\n'algorithm_key': 'min_max_uniform_quantize',\n'op_config': {\n'weight_tensor_config': {\n'num_bits': 8,\n'symmetric': True,\n'granularity': 'CHANNELWISE',\n'dtype': 'INT',\n'block_size': 0,\n},\n'compute_precision': 'INTEGER',\n'explicit_dequantize': False,\n'skip_checks': False,\n'min_weight_elements': 0\n},\n})\n]\n# Define the quantizer, with fp32 tflite model, and the recipe.\nqt = quantizer.Quantizer(fp32_model_path, the_recipe)\nquant_result = qt.quantize().export_model(dynamic_quant_model_path, overwrite=True)\nThe final step is the runtime.\nWhen you run this quantized model in LiteRT on an Android mobile device, it defaults to the XNNPACK delegate for CPU inference. Because XNNPACK integrates KleidiAI directly within the latest LiteRT API, developers get these optimizations automatically. These micro-kernels ensure that the core INT8 and FP16 matrix multiplications of the audio model run with maximum efficiency on the CPU.\nBelow is a representative snippet of how LiteRT inference is implemented in C++ using the CompiledModel API. Instructions in this guide are provided for running the audiogen app with LiteRT either on an Android™ device or macOS®.\n#include \"litert/cc/litert_compiled_model.h\"\n#include \"litert/cc/litert_environment.h\"\n#include \"litert/cc/litert_tensor_buffer.h\"\n// 1. Initialize the LiteRT Environment\nauto env = litert::Environment::Create({}).value();\n// 2. Create the CompiledModel from the .tflite file\n// Hardware acceleration (e.g., SME2 via KleidiAI) is handled automatically\nauto compiled_model = litert::CompiledModel::Create(\nenv, \"autoencoder_model.tflite\", litert::HwAccelerators::kCpu).value();\n// 3. Prepare input and output buffers\nauto autoencoder_inputs = compiled_model.CreateInputBuffers().value();\nauto autoencoder_outputs = compiled_model.CreateOutputBuffers().value();\n// 4. Write input data (e.g., random noise or conditioned embeddings)\nauto auto_in_lock_and_ptr = scoped_lock<float>(autoencoder_inputs[0],\nlitert::TensorBuffer::LockMode::kWrite);\n// Fill the input\n// 5. Execute inference\ncompiled_model.Run(autoencoder_inputs, autoencoder_outputs);\n// 6. Access and read the generated audio waveform from the output buffer\nauto auto_out_lock_and_ptr = scoped_lock<const float>(autoencoder_outputs[0], litert::TensorBuffer::LockMode::kRead);\n// Read the output\nWe now take our quantized fp16/int8 model from the prior section and benchmark both CPU single threaded and multi-threaded (MT) performance with the original FP32 Stable Audio Open Small model against our KleidiAI-optimized FP16 + INT8 model on an SME2-based Android device and on an Apple MacBook with M4.\nAs shown in the bar chart above, SME2 delivers more than a 2x performance improvement over the NEON instruction set, specialized for signal processing tasks. Even with a single core, it can generate 11 seconds of audio in under 8 seconds, which is acceptable from a user-experience perspective.\nThese optimizations are available for developers today. Start experimenting immediately using Google AI Edge tools and KleidiAI-accelerated LiteRT.\nExplore Arm’s sample repository to access the complete end-to-end journey for Stable Audio Open:\nAcknowledgements\nArm: Adnan Alsinan, Anitha Raj, Aude Vuilliomenet, Bala Gattu, Declan Cox, and Gian Marco Iodice\nStability AI credit: This post uses the Stable Audio Open Small model by Stability AI, released under the Stability AI Community License. Audio samples were generated using the model running on test devices via LiteRT & Arm Keidi AI.\nGoogle: Advait Jain, Andrei Kulik, Changmin Sun, Cormac Brick, Dillon Sharlet, Eric Yang, Jinjiang Li, Jing Jin, Lu Wang, Maria Lyubimtsev, Meghna Johar, Pedro Gonnet, Ram Iyengar, Sachin Kotwani, Terry (Woncheol) Heo, Vitalii Dziuba", "url": "https://wpnews.pro/news/accelerating-on-device-ai-a-look-at-arm-and-google-ai-edge-optimization", "canonical_source": "https://developers.googleblog.com/accelerating-on-device-ai-a-look-at-arm-and-google-ai-edge-optimization/", "published_at": "2026-05-20 03:09:51.164356+00:00", "updated_at": "2026-05-20 03:09:54.832831+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "hardware", "semiconductor", "developer-tools"], "entities": ["Arm", "Google AI Edge", "LiteRT", "XNNPACK", "Arm KleidiAI", "AI Edge Quantizer", "Model Explorer", "Stability AI"], "alternates": {"html": "https://wpnews.pro/news/accelerating-on-device-ai-a-look-at-arm-and-google-ai-edge-optimization", "markdown": "https://wpnews.pro/news/accelerating-on-device-ai-a-look-at-arm-and-google-ai-edge-optimization.md", "text": "https://wpnews.pro/news/accelerating-on-device-ai-a-look-at-arm-and-google-ai-edge-optimization.txt", "jsonld": "https://wpnews.pro/news/accelerating-on-device-ai-a-look-at-arm-and-google-ai-edge-optimization.jsonld"}}