Modular: Translating to Mojo via AI Agents

Modular released AI agent skills for its Mojo language that enable coding assistants to translate existing GPU kernels from CUDA and Triton into Mojo code, addressing the challenge that large language models lack training data on the young programming language. The skills, which can be installed with a single command, provide a lightweight adapter that corrects misconceptions and non-idiomatic patterns, allowing developers to leverage Mojo's performance advantages across NVIDIA, AMD, and Apple silicon GPUs while maintaining a familiar syntax.

Hippocratic AI + Modular to power real-time patient conversations. Read More → Inference Products Shared Endpoints Access frontier models via an API Dedicated Endpoints Mission critical reliability Custom models Your model, peak performance Deployment Options Our Cloud Fully managed, pay by usage Your Cloud Modular stack in your VPC Pricing Flexible plans for every team Models DeepSeek V4 Pro FLUX.2 Klein 9B Kimi K2.6 MiniMax M2.7 Wan 2.2 T2V A14B View All Text to audio Turn text into natural speech Image generation Generate images from text prompts Code generation Generate production-ready code Video generation Generate video from text + image Agentic Deploy AI agents anywhere Custom Models Kernel-level model control Case Studies Proven results from real customers MAX Framework GenAI native modeling & serving Mojo Language The best GPU & CPU performance Self-Hosted MAX+Mojo self-hosted by you Community Build the future of AI together Mojo Agent Skills Official AI agent skills from Modular Docs Deploy GenAI models, our cloud or yours Model Library Latest supported open models Mojo Docs Write high-performance kernels for CPUs and GPUs About Build AI for anyone, anywhere. Careers 👋 We’re currently hiring Culture What we believe Contact Us Request a demo May 13, 2026 Brad Larson Modular Team Product At Modular, we’re always experimenting with the latest agentic programming tools, integrating the best ones into our workflows, and learning quite a few lessons along the way. One thing we realized is that the Mojo language is ideally suited to the needs of modern AI coding agents. Mojo has a familiar syntax with minimal boilerplate, so it’s token-efficient for agents to read and write. Its type system and constraint model catch many common errors at compile time. Rather than having an agent chew tons of tokens to build something that may or may not work, then spend hours debugging it when it doesn’t, Mojo catches problems early and provides clear error messages to the agents. This tighter feedback loop is one reason typed languages are increasingly favored for agentic workflows. Mojo also doesn't trade ergonomics for performance. The same code that reads cleanly can target the full range of hardware Mojo supports, including NVIDIA, AMD, and Apple silicon GPUs. The only challenge is that Mojo is still a young language and LLMs haven’t been trained on lots of Mojo code. That’s why we’ve created tools to help agents produce correct, idiomatic Mojo from the start. This starts with open-sourcing of all of the Mojo code we’ve written at Modular, along with all of our Python APIs for MAX. Taking this to the next level, Mojo now has coding agent skills that plug into most AI coding assistants. These skills act as a lightweight adapter that corrects misconceptions, out-of-date patterns, and non-idiomatic code that AI models might produce. To get started with the skills, you can install them in any supported coding agent with a single command: npx skills add modular/skills Mojo’s killer application today is the ability to write high-performance and portable GPU kernels. However, most of the world’s kernel code today is written in CUDA and Triton primarily for NVIDIA GPUs . Using our AI agent skills, you can quickly translate these existing kernels to Mojo. Let’s see an example of this in action. In this example, we’ll start from Szymon Ożóg’s exploration of Softmax optimizations in CUDA. Clone this repository locally as a reference to follow along at home: git clone git@github.com:SzymonOzog/FastSoftmax.git In this tutorial, I’ll be using Claude Code with the Opus 4.7 model, but the same steps work with any frontier coding agent. First, let’s create a new Mojo project. Our new-modular-project skill handles project setup and walks you through choosing the right configuration. new-modular-project Start a new session in your AI agent and begin by giving it this prompt: I'd like to start a new Mojo project named mojo softmax using a quick uv environment and the nightly build of Mojo. Your agent should activate the new-modular-project skill and handle the project setup for you: ⏺ Skill new-modular-project ⎿ Successfully loaded skill ⏺ All options specified. Creating a quick uv environment with nightly Mojo. Listed 1 directory ctrl+o to expand ⏺ Bash mkdir mojo softmax && cd mojo softmax && uv venv && uv pip install mojo --index https://whl.modular.com/nightly/simple/ --prerelease allow ⎿ Running… and then report back when the new project is ready: ⏺ Bash source .venv/bin/activate && mojo --version ⎿ Mojo 0.26.3.0.dev2026042105 cfe16c5b ⏺ Project mojo softmax is set up at ./mojo softmax with a quick uv venv and nightly Mojo 0.26.3.0.dev2026042105 installed. To work in it: cd mojo softmax source .venv/bin/activate Once the project is configured with the Mojo compiler and tools, we can begin translating the CUDA reference code to Mojo. Use a prompt like the following, adjusting the path to match where you cloned FastSoftmax: In the directory ./FastSoftmax, there's a CUDA example of a fast softmax I'd like to port to Mojo. When done, build it and test it to verify that it works correctly. Your output may vary by agent, but in a local run of this I saw: ⏺ I've read all the CUDA kernels. The file contains 10 progressive optimizations, with softmax kernel8 as the default - it uses online normalization single-pass max+sum , float4 vectorized loads, warp shuffles, and shared-memory cross-warp reduction. I'll port this to Mojo using shared-memory tree reductions universal across GPU backends including Apple silicon with coalesced strided access, matching the algorithm of kernels 3-8. The agent will then activate the mojo-syntax and mojo-gpu-fundamentals skills to write a fast softmax.mojo file: mojo-syntax mojo-gpu-fundamentals fast softmax.mojo ⏺ Now I'll load the Mojo skills to guide the port. ⏺ Skill mojo-gpu-fundamentals ⎿ Successfully loaded skill ⏺ Skill mojo-syntax ⎿ Successfully loaded skill ⏺ Now I have enough understanding to write the Mojo port. I'll implement the softmax kernel, plus a test harness that verifies against a CPU reference. ⏺ Write fast softmax.mojo When complete, the agent will build and test the Mojo implementation to verify its correct operation: ⏺ Bash source .venv/bin/activate && mojo build fast softmax.mojo 2 &1 | head -80 ⎿ No output ⏺ Build succeeded. Running the test now. ⏺ Bash source .venv/bin/activate && ./fast softmax 2 &1 | head -40 ⎿ row 0 sum = 0.9999997 drift from 1.0 = 2.9802322e-07 row 1 sum = 0.99999994 drift from 1.0 = 5.9604645e-08 row 2 sum = 1.0000001 drift from 1.0 = 1.1920929e-07 … +8 lines ctrl+o to expand ⏺ The port is complete and verified working on the local GPU. Although your exact project format might differ from mine, your mojo softmax project should now include a correct Mojo kernel and tests that exercise it fast softmax.mojo in my case here . mojo softmax One key thing to note is that the original CUDA code for this example only worked on NVIDIA GPUs, but this new Mojo version runs on NVIDIA, AMD, and Apple silicon GPUs, as well as other future accelerators supported by Mojo. This is an incredibly fast way to bootstrap kernels, algorithms, and even models for new hardware. This newly translated Mojo version should be functionally identical to the source, so you can read through the code to understand how concepts map from CUDA to Mojo. As a next step, you can also attempt to perform further Mojo-specific optimizations to this code, or tune it for the exact hardware you are working with. This also isn’t limited to using CUDA as a reference, the same process will work with Triton kernels in Python or code in various other languages. As you’ve seen, you can rapidly translate existing reference code you may have in Python, CUDA, or many other sources to Mojo. Out of the box, this can even lead to concrete improvements. For example, when Automatika Robotics translated some CUDA and SYCL kernels used for autonomous navigation to Mojo, they saw immediate performance gains. In their own words: “Same workload we use in EMOS kompass-core: 5,001 trajectories × 1,000 points, 10 s horizon, 4 cost functions enabled. I should note that I have used Claude to translate my mojo kernels, using the official skills and no optimization work on the mojo side has been done yet. Hence the initial result is quite impressive.” Mojo 1.0 beta 1 has just been released, and using a frontier AI coding agent with these skills is a great way to get your older Mojo projects up-to-date for the official 1.0 release later this year. We know that LLMs benefit from languages that don’t change much over time, which is one reason we’re stabilizing Mojo for 1.0. In fact, we took a random sample of five community projects, installed these skills, and prompted Claude Opus 4.7: I’d like to update this project to the latest version of Mojo. In all five cases, the agent correctly updated the entire project to build on the latest Mojo 1.0 beta 1 release with no other assistance. These Mojo coding skills are available now. Here are three ways to put them to use: Speed up your Python. If you have a Python function that's become a bottleneck, an agent with the Mojo skills can translate it to Mojo. Point it at the slow code, and it will produce an initial Mojo port you can drop in, profile, and tune. If you want to go further, the same code can target a GPU with minimal changes. Replace CUDA or Triton with Mojo. As the softmax demo shows, the skills handle the structural translation from CUDA to Mojo. The same process works for Triton kernels, as well as other kernel domain-specific languages. You get a portable starting point that runs on NVIDIA, AMD, and Apple silicon GPUs, without rewriting from scratch. Get involved. The skills themselves are open source. If you hit a pattern the current skills don't handle well, open an issue or contribute a fix. The more real-world Mojo code agents encounter in the wild, the better they get at writing it. Install the skills with npx skills add modular/skills and let us know what you build in the Modular forum. npx skills add modular/skills Inkwell: Why Your Inference Platform Matters As Much As Your Model May 12, 2026 Modular 26.3: Mojo 1.0 Beta, MAX Video Gen, and more May 7, 2026 Day Zero Launch: Fastest Performance for Gemma 4 on NVIDIA and AMD April 2, 2026 Build the future of AI with Modular Sign up today Signup to our Cloud Platform today to get started easily. Browse open models Browse our model catalog, or deploy your own custom model Get all our latest news, announcements and updates delivered directly to your inbox. Unsubscribe at anytime. ⚠️ This form requires JavaScript to function. Please enable JavaScript in your browser to continue. Thanks for signing up to our newsletter 🚀 Thank you, Modular Sales Team