ACE: A Shared Path to Faster Matrix Math on x86 AMD and Intel, through the x86 Ecosystem Advisory Group, are standardizing new matrix math instructions called ACE (AI Computation Extensions) to accelerate AI workloads on x86 processors. ACE introduces outer product operations and tile registers to improve compute density for matrix multiplication, supporting popular AI data formats including INT8, BF16, and OCP FP8. The collaboration aims to deliver consistent, portable AI performance across devices from laptops to data center servers. Matrix multiplication, multiplying large grids of numbers, is the workhorse math behind neural networks and large language models LLMs . As AI becomes more common across PCs, workstations, and servers, the x86 ecosystem benefits when these core operations are faster, more efficient, and easier for developers to target consistently. That’s why AMD and Intel, along with partners across the ecosystem, are collaborating through the x86 Ecosystem Advisory Group EAG to standardize key architectural capabilities for x86, including matrix acceleration. At a glance - ACE AI Computation Extensions proposes new x86 instructions to accelerate matrix multiplication while integrating seamlessly with AVX10. - ACE uses “outer product” operations and tile registers to do more matrix math per instruction and keep intermediate results close to compute. - ACE supports popular AI data formats, including INT8, BF16, OCP FP8 and OCP MX formats MXFP8 and MXINT8 with inline block scaling. - The x86 EAG is aligning ACE as a standardized approach to matrix multiplication capabilities across devices, from laptops to data center servers. Why CPUs need better “matrix math” building blocks CPUs already accelerate math using SIMD Single Instruction, Multiple Data vector instructions. AVX10 is the next-generation direction for x86 vectors, and it can be used for matrix multiplication but scaling and compute density can be limited for today’s AI workloads. ACE is designed to raise that ceiling while working seamlessly alongside AVX10. ACE in plain English: build a 2D patch of results at once Traditional vector approaches often compute matrix results in more “one-dimensional” chunks. ACE introduces an outer product operation that accumulates into a two-dimensional tile, effectively building a patch of the output matrix per instruction. That 2D accumulation is where compute density improves. ACE defines eight tile registers each 512b × 16 rows plus a block scale register to support block scaling with certain low-precision formats. If you’re familiar with Intel AMX, the tile concept will feel recognizable: AMX introduced a tile-based programming model for accelerating matrix operations. To reduce platform friction, ACE is designed to be exposed to software as a new “palette” under the AMX accelerator framework, helping reuse much of the system programming model and operating system support for tile state. Why ACE pairs well with AVX10 ACE intentionally uses AVX10 vector registers as inputs. That means software can use AVX10 to prep and format data “just in time” before matrix operations, and then efficiently move data between vector and tile domains for surrounding work layout transforms, conversions, and post-processing . ACE’s eight tiles also enable blocked kernels that keep multiple output tiles live at once for better input reuse and reduced bandwidth pressure. Supporting the formats AI uses today and the efficiency they demand AI performance and efficiency increasingly depend on low-precision formats. ACE v1 supports INT8 and BF16 and, notably, also includes native support for OCP FP8 and OCP MX formats MXFP8 and MXINT8 with inline block scaling. In broad terms, block scaling lets groups of small values share a scale factor—helping preserve useful numeric range while reducing memory and bandwidth demands. Standardizing through the x86 Ecosystem Advisory Group The x86 EAG was launched in October 2024 to improve compatibility, predictability, and consistency across x86 processor-based products through standardized, developer-friendly features. In its first year, the EAG highlighted ACE as a standardized approach to matrix multiplication capabilities across devices ranging from laptops to data center servers. What This Means for the Enterprise: Portability and Operational Simplicity For CIOs and enterprise IT leaders, ACE opens a path to portable AI performance on CPUs across your fleet. By aligning matrix acceleration through the x86 Ecosystem Advisory Group, AMD and Intel are working toward consistent, standardized capabilities that software can depend on from laptops to data‑center servers. That helps ISVs and internal teams reduce vendor‑specific code forks, simplify validation, and keep a single deployment playbook across on‑prem and cloud environments. In practice, this can mean faster rollouts of AI‑enabled features, fewer “special hardware” exceptions, and clearer lifecycle planning as platforms refresh. When combined with mainstream toolchains and libraries, standardized acceleration supports predictable performance improvements while preserving x86 compatibility and operational stability. This reduces risk and helps control cost as AI scales. What comes next ACE is a hardware + software story. Enablement work is underway across compilers, debuggers, and profilers, with planned integration into optimized kernels, libraries, and ML frameworks such as PyTorch and TensorFlow. The goal is straightforward: deliver faster, more efficient matrix math on CPUs—while keeping the x86 developer experience consistent across the ecosystem.