What about OpenCL and CUDA C++ alternatives?

wpnews.pro

Now serving MiniMax-M3! Request access today. Read More →

Inference Products

Shared Endpoints

Access frontier models via an API

Dedicated Endpoints

Mission critical reliability

Custom models

Your model, peak performance

Deployment Options

Our Cloud

Fully managed, pay by usage

Your Cloud

Modular stack in your VPC

Pricing

Flexible plans for every team

Models

DeepSeek V4 Pro

FLUX.2 Klein 9B

Kimi K2.6

MiniMax M3

Wan 2.2 T2V A14B

View All

Text to audio

Turn text into natural speech

Image generation

Generate images from text prompts

Code generation

Generate production-ready code

Video generation

Generate video from text + image

Agentic

Deploy AI agents anywhere

Custom Models

Kernel-level model control

Case Studies Proven results from real customers

MAX Framework

GenAI native modeling & serving

Mojo Language

The best GPU & CPU performance

Self-Hosted

MAX+Mojo self-hosted by you

Community

Build the future of AI together

Mojo Agent Skills

Official AI agent skills from Modular

Docs

Deploy GenAI models, our cloud or yours

Model Library

Latest supported open models

Mojo Docs

Write high-performance kernels for CPUs and GPUs

About

Build AI for anyone, anywhere.

Careers

👋 We’re currently hiring!

Culture

What we believe

Contact Us

Request a demo

March 5, 2025

Chris Lattner

Series

GenAI may be new, but GPUs aren’t! Over the years, many have tried to create portable GPU programming models using C++, from OpenCL to SYCL to OneAPI and beyond. These were the most plausible CUDA alternatives that aimed to democratize AI compute, but you may have never heard of them - because they failed to be relevant for AI.

These projects have all contributed meaningfully to compute, but if we are serious about unlocking AI compute for the future, we must critically examine the mistakes that held them back—not just celebrate the wins. At a high level, the problems stem from the challenges of "open coopetition"—where industry players both collaborate and compete—as well as specific management missteps along the way.

Let’s dive in. 🚀

There are many projects that aimed to unlock GPU programming, but the one I know best is OpenCL. Like CUDA, OpenCL aimed to give programmers a C++-like experience for writing code that ran on the GPU. The history is personal: in 2008, I was one of the lead engineers implementing OpenCL at Apple (it was the first production use of the Clang compiler I was building). After we shipped it, we made the pivotal decision to contribute it to the Khronos Group so it could get adopted and standardized across the industry.

That decision led to broad industry adoption of OpenCL (see the logos), particularly in mobile and embedded devices. Today, it remains hugely successful, powering GPU compute on platforms like Android, as well as in specialized applications such as DSPs. Unlike CUDA, OpenCL was designed for portability from the outset, aiming to support heterogeneous compute across CPUs, GPUs, and other accelerators. OpenCL also inspired other systems like SyCL, Vulkan, SPIR-V, oneAPI, WebCL and many others.

However, despite its technical strengths and broad adoption, OpenCL never became the dominant AI compute platform. There are several major reasons for this: the inherent tensions of open coopetition, technical problems that flowed from that, the evolving requirements of AI, and NVIDIA’s unified strategy with TensorFlow and PyTorch.

In 2008, Apple was a small player in the PC space, and thought that industry standardization would enable it to reach more developers. However, while OpenCL did gain broad adoption among hardware makers, its evolution quickly ran into a major obstacle: the speed of committee-driven development. For Apple, this slow-moving, consensus-driven process was a dealbreaker: we wanted to move the platform rapidly, add new features (e.g. add C++ templates), and express the differentiation of the Apple platform. We faced a stark reality - the downside of a committee standard is that things suddenly moved at committee consensus speed… which felt glacial.

Hardware vendors recognized the long-term benefits of a unified software ecosystem, but in the short term, they were fierce competitors. This led to subtle but significant problems: instead of telling the committee about the hardware features you’re working on (giving a competitor a head start), participants would keep innovations secret until after the hardware shipped, and only discuss it after these features became commoditized (using vendor-specific extensions instead).

This became a huge problem for Apple, a company that wanted to move fast in secret to make a big splash with product launches. As such, Apple decided to abandon OpenCL: it introduced Metal instead, never brought OpenCL to iOS, and deprecated it out of macOS later. Other companies stuck with OpenCL, but these structural challenges continued to limit its ability to evolve at the pace of cutting-edge AI and GPU innovation.

While Apple boldly decided to contribute the OpenCL standard to Kronos, it wasn’t all-in: it contributed OpenCL as a technical specification—but without a full reference implementation. Though parts of the compiler front-end (Clang) was open source, there was no shared OpenCL runtime, forcing vendors to develop their own custom forks and complete the compiler. Each vendor had to maintain its own implementation (a ”fork”), and without a shared, evolving reference, OpenCL became a patchwork of vendor-specific forks and extensions. This fragmentation ultimately weakened its portability—the very thing it was designed to enable. Furthermore, because vendors held back differentiated features or isolated them into vendor-specific extensions, which exploded in number and fragmented OpenCL (and the derivatives), eroding its ability to be a unifying vendor-agnostic platform. These problems were exacerbated by weaknesses in OpenCL’s compatibility and conformance tests. On top of that, it inherited all the “C++ problems” that we discussed before.

Developers want stable, well-supported tools—but OpenCL’s fragmentation, weak conformance tests, and inconsistent vendor support made it an exercise in frustration. One developer summed it up by saying that using OpenCL is “about as comfortable as hugging a cactus”! Ouch.

While OpenCL was struggling with fragmentation and slow committee-driven evolution, AI was rapidly advancing—both in software frameworks and hardware capabilities. This created an even bigger gap between what OpenCL offered and what modern AI workloads needed. The introduction of TensorFlow and PyTorch kicked off a revolution in AI research - powered by improved infrastructure and massive influx of BigCo funding. This posed a major challenge for OpenCL. While it enabled GPU compute, it lacked the high-level AI libraries and optimizations necessary for training and inference at scale. Unlike CUDA, it had no built-in support for key operations like matrix multiplication, Flash Attention, or datacenter-scale training.

Cross-industry efforts to expand TensorFlow and PyTorch to use OpenCL quickly ran into fundamental roadblocks (despite being obvious and with incredible demand). The developers who kept hugging the cactus soon discovered a harsh reality: portability to new hardware is meaningless if you can’t unlock its full performance. Without a way to express portable hardware-specific enhancements—and with coopetition crushing collaboration—progress stalled.

One glaring example? OpenCL still doesn’t provide standardized support for Tensor Cores—the specialized hardware units that power efficient matrix multiplications in modern GPUs and AI accelerators. This means that using OpenCL often means a 5x to 10x slowdown in performance compared to using CUDA or other fragmented vendor native software. For GenAI, where compute costs are already astronomical, a 5x to 10x slowdown isn’t just inconvenient—it’s a complete dealbreaker.

While OpenCL struggled under the weight of fragmented governance, NVIDIA took a radically different approach—one that was tightly controlled, highly strategic, and ruthlessly effective, as we discussed earlier. It actively co-designed CUDA’s high-level libraries alongside TensorFlow and PyTorch, ensuring they always ran best on NVIDIA hardware. Since these frameworks were natively built on CUDA, NVIDIA had a massive head start—and it doubled down by optimizing performance out of the box. NVIDIA maintained a token OpenCL implementation—but it was strategically hobbled (e.g., not being able to use TensorCores)—ensuring that a CUDA implementation would always be necessary. NVIDIA’s continued and rising dominance in the industry put it on the path to ensure that the CUDA implementations would always be the most heavily invested in. Over time, OpenCL support faded, then vanished—while CUDA cemented its position as the undisputed standard.

The history above is well understood by those of us who lived through it, but the real value comes from learning from the past. Based on this, I believe successful systems must:

These are the fundamental reasons why I don’t believe that committee efforts like OpenCL can ever succeed. It’s also why I’m even more skeptical of projects like Intel’s OneAPI (now UXL Foundation) that are notionally open, but in practice, controlled by a single hardware vendor competing with all the others.

At the same time that C++ approaches failed to unify AI compute for hardware makers, the AI industry faced a bigger challenge—even using CUDA on NVIDIA hardware. How can we scale AI compute if humans have to write all the code manually? There are too many chips, too many AI algorithms, and too many workload permutations to optimize by hand.

As AI’s dominance grew, it inevitably attracted interest from systems developers and compiler engineers—including myself. In the next post, we’ll dive into widely known “AI compiler” stacks like TVM, OpenXLA, and MLIR—examining what worked, what didn’t, and what lessons we can take forward. Unfortunately, the lessons are not wildly different than the ones above:

History may not repeat itself, but it does rhyme. - Mark Twain

See you next time—until then, may the FLOPS be with you! 👨💻

-Chris

Learn more about the MAX Platform and the Mojo programming language, and join us in building the next wave of AI innovation.

Matrix Multiplication on Blackwell: Part 4 - Breaking SOTA

September 19, 2025

Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance

September 12, 2025

Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

September 5, 2025

Build the future of AI with Modular

Sign up today

Signup to our Cloud Platform today to get started easily.

Browse open models

Browse our model catalog, or deploy your own custom model

Get all our latest news, announcements and updates delivered directly to your inbox. Unsubscribe at anytime.

⚠️ This form requires JavaScript to function. Please enable JavaScript in your browser to continue.

Thanks for signing up to our newsletter! 🚀

Thank you,

Modular Sales Team

source & further reading

modular.com — original article Modular: Qualcomm Completes Acquisition of Modular You can now run Max AI models on Apple Silicon Modular: Qualcomm to Acquire Modular

What about OpenCL and CUDA C++ alternatives?

Run your AI side-project on zahid.host