Automating model design for edge AI

DeepGate's automated model design system, combining neural architecture search, a compiler, and real-hardware measurements, produces models for microcontrollers that run up to 45× faster and use up to 11× less RAM than reference models on MLPerf Tiny benchmarks, enabling advanced AI on resource-constrained devices.

Automating model design for edge AI. DeepGate's model search and compiler produce models that run up to 45× faster and use up to 11× less RAM than reference models on the same microcontroller. Building models for microcontrollers is still largely a manual process. Teams either design models from scratch or adapt existing architectures, iteratively modifying them to fit the target hardware. On resource-constrained devices, they often face a trade-off between models that are too large or slow to run and models that fit on the device but make too many mistakes to be useful. We’ve built the foundations of an automated model design system. By combining neural architecture search, the DeepGate compiler /blog/compiler , and real-hardware measurements obtained through our development platform https://bitweaver.deepgate.ai , we can automatically search for models tailored to a target microcontroller. Across the four standard MLPerf Tiny benchmark tasks, ranging from detecting spoken words in audio to identifying the presence of a person in an image, the resulting models ran up to 45× faster and used up to 11× less RAM than the reference models. For example, on the MLPerf Tiny keyword spotting benchmark running on the Analog Devices MAX32655, our search reduced inference latency from 104.3 ms to 2.3 ms and RAM usage from 23.7 KB to 2.1 KB , while maintaining over 90% classification accuracy. Such gains can enable machine learning models to run on cheaper hardware, extend battery life, and free up memory and compute for other tasks. By pushing the efficiency frontier, we move more advanced AI workloads within reach of microcontrollers, bringing increasingly capable intelligence to billions of devices. Outperforming the reference models on the same hardware We evaluated our search on MLPerf Tiny v1.4 https://mlcommons.org/benchmarks/inference-tiny/ , the standard benchmark suite for machine learning on microcontrollers. The benchmark covers four representative edge workloads: keyword spotting, visual wake words, CIFAR-10 image classification, and anomaly detection. Each task has a predefined quality target, from 90% top-1 accuracy for keyword spotting to 0.85 AUC for anomaly detection. For each workload, the goal was to meet the target while producing the smallest and fastest model possible, with input dimensions kept fixed to ensure a fair comparison against the reference models. Across the evaluated boards, our search system and compiler delivered up to 45× faster inference and up to 11× lower RAM usage. Because memory is often the primary constraint on microcontrollers, these memory reductions can be especially important: in some cases, models that exceeded memory limits under the vendor toolchain were able to fit and run successfully after search and compilation. The results below compare the MLPerf Tiny reference model compiled with each vendor’s toolchain against architectures automatically discovered by our search system and deployed with the DeepGate compiler, with all results measured on the same hardware. Explore the comparisons by switching boards and toggling between latency and RAM usage; RAM is measured as the tensor arena plus peak stack size. DeepGate runs up to 36.1× faster How we did it: two complementary search methods We ran two search systems side by side and used whichever performed best for a given task. On the MLPerf Tiny workloads, three of the four final models came from our neural architecture search NAS system, while the anomaly detection model came from our agentic search. Agentic architecture search uses an LLM agent that proposes one change at a time – either to the architecture or the training recipe – trains the resulting model, benchmarks it on real hardware, and keeps the change only if the target metric improves. The approach is open-ended and can explore ideas outside any predefined search space, but it operates greedily, improving one model at a time. Supernet NAS builds on and extends the Once-for-All https://arxiv.org/abs/1908.09791 and MCUNet https://arxiv.org/abs/2007.10319 approaches, adapted for microcontroller deployment using int8 quantization-aware training https://arxiv.org/abs/1712.05877 while keeping input resolution fixed for fair comparison against the reference models. Rather than training every candidate architecture independently, a single supernet can be specialised into many different models with different size, speed, and accuracy trade-offs. The two approaches offer complementary strengths: | Agentic search | Supernet NAS | | |---|---|---| | What it can change | Anything in code – architecture and training recipe | A predefined architecture space depth, kernel size, expansion ratio | | What you get out | One model, improved step by step | A family of models spanning different size, speed, and accuracy trade-offs | | Best when | The problem is open-ended or the design space is poorly understood | The design space is well understood and you need optimised models for multiple hardware targets | Both approaches run on the same in-house infrastructure. Each model is compiled into an efficient static binary by the DeepGate compiler and deployed to target microcontrollers through our development platform, which provides a unified benchmarking API across multiple boards. The resulting latency and memory usage are measured directly on the target hardware. What’s next Our long-term goal is to automate the design of highly efficient models, from defining a task to deploying an optimised model on an edge device. To achieve this, we are exploring how to combine our NAS and agentic search methods into a single optimisation loop that unifies the strengths of both approaches. At the same time, we’re expanding the set of neural network layers available to the search system, including novel DeepGate layers designed to use less memory and run faster than conventional neural network layers. Incorporating these layers into the search space will unlock even greater efficiency on resource-constrained devices, enabling AI workloads once thought beyond the reach of microcontrollers – and ultimately bringing increasingly capable intelligence to billions of devices. If you’re interested in shrinking your own models – or accessing our optimised vision and audio models – we’d love to hear from you. References - Jacob et al., , CVPR 2018. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference - Cai, Gan, Wang, Zhang, Han, , ICLR 2020. Once-for-All: Train One Network and Specialize it for Efficient Deployment - Lin, Chen, Lin, Cohn, Gan, Han, , NeurIPS 2020. MCUNet: Tiny Deep Learning on IoT Devices