# GateGPT: Running Transformers in Pure Digital Logic on FPGAs

> Source: <https://www.devclubhouse.com/a/gategpt-running-transformers-in-pure-digital-logic-on-fpgas>
> Published: 2026-06-20 04:23:18+00:00

[AI](https://www.devclubhouse.com/c/ai)Article

# GateGPT: Running Transformers in Pure Digital Logic on FPGAs

By synthesizing a GPT model directly into hardware, GateGPT achieves massive throughput at a fraction of the clock speed.

[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)

Edge AI is currently dominated by a software-on-hardware paradigm: we take massive general-purpose processors (GPUs, TPUs, or NPUs), load a software runtime, and stream model weights through memory. But what if the neural network *is* the hardware?

In June 2026, independent hardware engineer Fabio Guzman introduced **GateGPT**, an open-source Register Transfer Level (RTL) implementation of Andrej Karpathy's microGPT. Synthesized onto a 16-year-old [Xilinx](https://www.xilinx.com) Virtex-5 FPGA (XC5VLX110T) running at a modest 80 MHz, GateGPT generates up to 69,200 tokens per second, with a sustained average of approximately 60,600 tokens per second.

This is not just an impressive retro-hardware hack. GateGPT demonstrates a credible, GPU-free inference baseline for edge deployments. By compiling a transformer directly into digital logic gates, it bypasses the instruction-fetch overhead of CPUs and the massive power envelopes of GPUs, proving that highly optimized, application-specific logic can deliver blazing-fast inference on minimal power and clock budgets.

## The Architecture of GateGPT: Microcode ROM and Datapath Actuators

Instead of building a monolithic, rigid state machine to handle the transformer's operations, Guzman opted for a hybrid approach: a microcode-ROM sequencer architecture. This design is conceptually closer to a classic CPU than a traditional hardwired neural network accelerator.

At its core, a small program ROM contains macro-instructions. A micro-program counter fetches one macro-op per clock cycle, triggers a corresponding modular datapath actuator, and halts until it receives a "done" signal. This instruction schedule is encoded in a program ROM (`ucode.hex`

), compiled by a custom assembler (`tools/ucode_asm.py`

).

The heavy lifting is distributed across specialized, modular hardware blocks (actuators) that share a true dual-port Block RAM (BRAM) scratchpad called `vmem`

. This scratchpad stores both active activations and the persistent Key-Value (KV) cache.

The actuators include:

: A parallel multiply-accumulate tile designed for linear projections, capable of processing 24 lanes by 2 columns per cycle.`matvec`

: An RMSNorm unit utilizing hardware-based unsigned division (`norm`

`udiv`

) and inverse square root (`isqrt`

) primitives, processing 2 elements per cycle.: A single-position multi-head causal attention block equipped with per-head parallel dividers.`attn`

: A fixed-point exponential calculator using a 17-entry lookup table combined with linear interpolation.`exp_unit`

: A module that handles temperature-scaled softmax and Linear Congruential Generator (LCG) categorical sampling, or falls back to greedy argmax.`sampler`

: Handle embedding lookups, residual additions, and ReLU activations.`embed`

and`vecop`

## The Hardware KV Cache and Q5.11 Fixed-Point Math

To fit a transformer into the limited logic of a 2008-era FPGA—occupying just 8% of the Virtex-5's resources—GateGPT employs aggressive optimization and strict numerical constraints.

[Serverless Inference by DigitalOcean 55+ models, every modality. One API key, one bill.](https://www.devclubhouse.com/go/ad/13)

The model uses signed Q5.11 fixed-point arithmetic. This 16-bit format allocates 5 bits for the integer part (including the sign) and 11 bits for the fractional part. Fixed-point math completely eliminates the need for complex, area-heavy floating-point units (FPUs), allowing the arithmetic logic to be synthesized into simple, fast adder and multiplier trees.

The architectural crown jewel of GateGPT's performance is its hardware-native KV cache. In software-based inference, managing the KV cache involves complex memory pointer manipulation and dynamic allocation. In GateGPT, the KV cache is baked directly into the `vmem`

BRAM. Instead of recomputing the entire context window (up to 16 tokens) for every newly generated token, the `attn`

actuator calculates only the K and V projections for the current token and appends them to the pre-allocated cache lines in `vmem`

.

Through nine distinct optimization stages, Guzman increased the design's throughput by 28x—climbing from an initial 2,433 tokens/sec to the peak 69,200 tokens/sec. This massive speedup highlights the raw efficiency of hardware-level pipelining and memory-bandwidth matching.

## The Developer Angle: Compiling to Silicon vs. Edge NPUs

For developers building edge AI applications—such as robotics, IoT sensors, or embedded medical devices—GateGPT represents a fork in the road.

Currently, edge AI relies on microcontrollers or low-power NPUs running lightweight runtimes like [TensorFlow Lite](https://www.tensorflow.org/lite) or MicroTVM. While these platforms offer flexibility, they introduce layers of abstraction: compiler toolchains, runtime interpreters, and OS scheduling.

GateGPT offers an alternative: compiling the model directly to Register Transfer Level (RTL) using [Python](https://www.python.org)-based reference models (like Karpathy's microGPT) and synthesizing it into Verilog or VHDL.

| Dimension | Edge NPU / Microcontroller | RTL-Synthesized Transformer (GateGPT) |
|---|---|---|
Latency |
Milliseconds (variable due to OS/runtime overhead) | Microseconds (deterministic, clock-cycle accurate) |
Power Consumption |
Watts (typically 1W to 15W) | Milliwatts (fraction of a watt at low clock speeds) |
Flexibility |
High (swap models by loading a new binary) | Low (requires re-synthesis and FPGA flashing) |
Hardware Cost |
Medium to High (specialized silicon) | Low (can run on cheap, legacy, or radiation-hardened FPGAs) |

In practice, adopting a GateGPT-style workflow requires a shift in developer tooling. Instead of writing PyTorch code and exporting to ONNX, the workflow looks like this:

**Train and Quantize**: Train a micro-model in PyTorch, quantizing weights to Q5.11 fixed-point.** Generate Microcode**: Use a tool like`ucode_asm.py`

to compile the model's execution graph into a sequence of macro-instructions for the ROM.**Synthesize and Route**: Run the RTL through FPGA synthesis tools to map the actuators and memory blocks to the target silicon.** Deploy**: Flash the bitstream to the FPGA.

The obvious caveat is scale. GateGPT runs a tiny model (4,192 parameters, 27-character vocabulary). Scaling this architecture to a 1-billion parameter model is currently bottlenecked by FPGA on-chip memory (BRAM) capacity. However, for highly specialized, ultra-low-latency tasks—such as wake-word detection, real-time signal filtering, or local character-level parsing—this approach is unmatched.

## A New Baseline for Edge AI

GateGPT is a compelling proof of concept that challenges the assumption that AI inference requires massive, power-hungry processors. By proving that a full transformer with a KV cache can run efficiently at just 80 MHz on legacy hardware, it opens the door for a new class of deterministic, ultra-low-power edge AI devices. For developers willing to venture into RTL and hardware synthesis, the reward is inference speed and efficiency that software runtimes simply cannot match.

## Sources & further reading

[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)· Senior Editor

Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.

## Discussion 0

No comments yet

Be the first to weigh in.
