# OpenAI and Broadcom's Jalapeño, a Custom Inference ASIC: Inference ASIC vs GPU

> Source: <https://dev.to/pueding/openai-and-broadcoms-jalapeno-a-custom-inference-asic-inference-asic-vs-gpu-36jm>
> Published: 2026-06-27 11:21:30+00:00

**What:** The **OpenAI and Broadcom Jalapeño announcement** (June 24, 2026) is OpenAI's **first custom LLM-inference ASIC** — a reticle-sized compute chiplet paired with HBM, built to **run** models rather than train them. The idea it makes concrete is an **inference-optimized ASIC versus a general-purpose GPU**.

**Why:** At decode time the bottleneck is usually **moving data, not doing math**, so a chip co-designed around that movement can serve the same tokens using **far less power per token** — early testing reports substantially better performance-per-watt (final numbers still being measured), which at OpenAI's scale materially changes serving cost.

**vs prior:** A **general-purpose GPU** runs anything — training, graphics, every model — and pays in silicon and power for that flexibility; Jalapeño is **hard-wired for inference only**, trading the GPU's versatility for a shorter, faster path between memory and compute.

A kitchen rebuilt to cook one dish, with the pantry moved beside the stove.

```
                  THE ONE DISH: LLM inference
                            │
            ┌───────────────┴───────────────┐
            │                               │
     ┌──────▼───────┐                ┌──────▼───────┐
     │ Inference    │                │ General      │
     │ ASIC         │                │ GPU          │
     │ (one dish)   │                │ (whole menu) │
     └──────┬───────┘                └──────┬───────┘
            │                               │
   pantry beside the stove        pantry down the hall
   (HBM next to compute)          (data travels far)
            │                               │
            ▼                               ▼
   ✓ most plates per gas          ✗ pays power for
     (perf-per-watt)                flexibility unused
```

**ASIC** — An **Application-Specific Integrated Circuit** — silicon built for **one kind of job** rather than general-purpose computing. Giving up a general processor's flexibility buys speed and energy efficiency on that job. Jalapeño's job is LLM inference.

**HBM** — High-Bandwidth Memory — stacked DRAM placed **physically very close to the compute die** so data reaches the math units faster. It is the same fast memory used on high-end GPUs, and it is where the model actually lives during serving.

**Inference vs training** — Training **builds** a model's weights; inference **runs** the finished weights to generate tokens. They stress hardware differently, so a chip can be excellent at one and unable to do the other. Jalapeño is **inference-only**.

**Memory-bandwidth-bound** — When a computation spends most of its time **waiting for data to arrive from memory** rather than doing arithmetic. Single-token decode is the classic example: lots of bytes read, little math per byte.

**Tape-out** — The moment a chip design is finished and **sent to the fab to be manufactured**. Jalapeño went from first design to tape-out in **roughly nine months**, which OpenAI describes as one of the fastest such cycles to date.

**Reticle-sized chiplet** — The *reticle* is the largest area a chip-making machine can pattern in a single exposure (around 800 mm²). A **reticle-sized compute chiplet** is about as large as one die can physically get — Jalapeño pairs one such tile with HBM.

**Performance-per-watt** — Useful work (tokens generated) divided by the **electrical power it costs**. At data-center scale this — not peak speed alone — sets the bill, which is why a custom inference chip targets it directly.

The news.On June 24, 2026,OpenAI and BroadcomunveiledJalapeño, OpenAI's first "Intelligence Processor" — a purpose-builtASIC for LLM inference, not a repurposed training accelerator or a general-purpose AI chip. It pairs a singlereticle-sized compute chipletwithHBM(not commodity DRAM) to hold high throughput and low latency together, and was co-designed from first design totape-out in roughly nine months. Engineering samples are already running production workloads in the lab, includingGPT-5.3-Codex-Spark, with early testing reporting performance-per-watt "substantially better" than current state-of-the-art (final numbers still being measured). Initial deployment is targeted forend of 2026.[Read the announcement →]

Picture a restaurant kitchen that can cook anything on the menu — pastry, grill, soup, all of it. That flexibility is wonderful, and it is exactly what a **general-purpose GPU** gives you: thousands of programmable cores that will run any parallel workload you throw at them, from training a model to rendering a game. **Jalapeño is that kitchen torn down and rebuilt to cook one dish — LLM inference — and nothing else.** The bet is that if you only ever cook one dish, a kitchen shaped around that single dish will cook it faster and far more cheaply than the do-everything kitchen ever could.

So what is the "one dish" actually limited by? Here is the part that surprises people: **at decode time, the thing slowing the kitchen down is not the chef's hands — it is the cooks walking ingredients in from a far pantry.** When a model generates a token, at small batch sizes it must stream the model's weights out of memory and through the compute units once, while doing comparatively little arithmetic per byte read. That makes single-token decode **memory-bandwidth-bound** — the roofline tips toward memory, and the math units sit mostly idle, waiting on data. The bottleneck the whole chip is fighting is *data movement*.

```
Single-token decode — where the time goes:

moving data  ████████████████████████████████  dominates
computing    █                                  a sliver
```

The diagram makes the imbalance concrete: in the bandwidth-bound regime, the pink "moving data" segment dominates and the green "computing" segment is a sliver. **Jalapeño's answer is the obvious one once you see the problem — move the pantry next to the stove.** It pairs that big compute chiplet with **HBM kept physically close**, so the costly trip between memory and compute is as short and as fast as the silicon allows. OpenAI says the design was derived from its *own* measurements of how its models behave at serving time, which is what "co-designed" really means here: the chip is shaped around the bottleneck the company actually observed, not a generic one.

Walk the decode math on a single token *(illustrative numbers — OpenAI has not published Jalapeño's figures)*. Say a model holds **100 GB of weights** and the accelerator reads them from memory at **4 TB/s**. Generating one token must stream those weights through compute roughly once, so the time is about **100 GB ÷ 4 TB/s = 25 ms** — and across that 25 ms the arithmetic units are mostly idle, waiting. Now **double the effective memory bandwidth and that 25 ms roughly halves**; double the raw compute instead and almost nothing changes. **That is the whole reason an inference chip is built around feeding the math units, not stacking more of them** — and why the headline metric is *performance-per-watt*, not peak FLOPs.

None of this means GPUs are going away. The trade Jalapeño makes is real and one-directional: **you give up the GPU's ability to train, to switch to a very different kind of workload, to run the whole range of models and tasks a GPU handles.** A custom ASIC only pays off when you run one workload at enormous, sustained scale — which is precisely OpenAI's situation, and precisely why a startup serving a thousand requests a day would still reach for a GPU. The interesting signal is not "ASICs beat GPUs"; it is that LLM inference has become a large and stable enough workload to justify burning a chip for it.

| Chip | Built for | Flexibility | Where it wins |
|---|---|---|---|
| General-purpose GPU | training + inference + any parallel workload | Highest | The default — runs anything, backed by a mature software ecosystem |
| Repurposed training accelerator | training, also used to serve | High | Strong throughput, but carries training-only hardware that idles during inference |
Inference ASIC (Jalapeño) |
LLM inference only |
Lowest | Built for top performance-per-watt on its one workload at scale (early results); inference-only, far less flexible |

*Goes deeper in: GPU & CUDA → Roofline Model → The Bottleneck Question*

An inference ASIC is an Application-Specific Integrated Circuit — silicon built for one kind of job rather than general-purpose computing — made to run (not train) large language models. OpenAI and Broadcom's Jalapeño, unveiled June 24, 2026, is OpenAI's first such chip: a reticle-sized compute chiplet paired with HBM, co-designed around the data-movement bottleneck of serving models at scale. It gives up a GPU's general-purpose flexibility in exchange for higher performance-per-watt on that single workload (early testing reports substantially better, with final numbers still being measured).

At decode time, generating a token is usually memory-bandwidth-bound — the chip spends most of its time moving the model's weights out of memory, not doing arithmetic. A general-purpose GPU pays in silicon and power for flexibility that inference never uses. A chip co-designed around the data-movement bottleneck — a large compute chiplet with HBM kept close — can serve the same tokens at substantially better performance-per-watt in early testing (final numbers still being measured), which at OpenAI's scale materially changes serving cost.

A GPU is general-purpose: thousands of programmable cores that run training, graphics, and any model. Jalapeño is an ASIC built for LLM inference only — it cannot train and is far less flexible than a general-purpose GPU. That is the trade: it loses the GPU's versatility and gains a shorter, faster path between memory and compute, which is what matters when the bottleneck is data movement rather than raw math. A custom ASIC pays off only when you run one workload at enormous, sustained scale.

Originally posted on [Learn AI Visually](https://learnaivisually.com/ai-explained/jalapeno-inference-asic-vs-gpu).
