At WWDC 26, Apple announced the Core AI framework, the official successor to Core ML. It is designed to allow developers to run large language models and generative AI entirely on-device, supporting both custom-converted PyTorch models and pre-optimized open-source models.
Apple says the new Core AI framework provides a unified architecture for deploying models ranging from compact 3B-parameter vision models to large-scale LLMs, including reasoning models with up to 70B-parameter reasoning models, across the iPhone, iPad, Mac, and Apple Vision Pro.
Core AI is the technology underpinning Apple Intelligence, and with the next release of its OSes and toolchain, Apple is making it available to developers to build what it calls "custom intelligence". Core AI, which can only run on Apple Silicon, ensures user data privacy, zero server dependencies, and zero per-token cloud costs.
Key Core AI capabilities include unified hardware access, allowing workloads to seamlessly run across the CPU, GPU, and Neural Engine under one API; a memory-safe Swift API enabling zero-copy data paths and fine-grained control over inference memory; and ahead-of-time (AOT) compilation, which shifts work off the user's device, yielding near-instant load times.
As mentioned, you can convert a PyTorch model into a Core AI model using Core AI PyTorch. The simplest approach is exporting a PyTorch as a torch.export.ExportedProgram
and convert it to a CoreAI AIProgram
using TorchConverter().add_exported_program(ep).to_coreai()
.
Alternatively, you can author a new Core AI model from a PyTorch one using built-in composite ops provided by the library, such as attention, RoPE embeddings, RMSNorm, and gather-matmul
, registering custom lowering function to map new PyTorch ops to Core AI IR, or even creating custom Metal kernels for lower-level optimization.
When converting a PyTorch model, an critical step is compressing it for deployment on Apple hardware. This process applies optimization techniques such as quantization and palettization, which are designed to align with the execution patterns of the Core AI runtime by default, ensuring efficient on-device performance.
Model compression can help reduce the memory footprint of your model (disk size and at runtime), reduce inference latency, reduce power consumption, or optimize them all at once.
One important aspect of running an AIModel
is its automatic specialization to the current hardware and OS version, which is carried through when the model is first loaded into the model cache. As a result, the first attempt to use a model may take significantly longer than subsequent runs, once the model has been already cached. Developers can control how and when this process happens by customizing
, accessing the
SpecializationOptions
AICacheModel
With the introduction of Core AI, Apple is providing support for three distinct approaches to run ML/AI on its operating systems: Core ML, Core AI, and MLX Swift. Based on developer discussions based on Hacker News, Apple seems to suggest using Core ML for "classic, non-neural ML", such as decision trees or tabular feature engineering, Core AI for neural networks and transformers, and MLX for working with custom model weights—though potentially with lower performance. Community feedback also notes that while Core AI "makes it easier to incorporate high-performance LLMs", its long-term value will depend "on the the future growth of the official Core AI/community".