{"slug": "gategpt-running-transformers-in-pure-digital-logic-on-fpgas", "title": "GateGPT: Running Transformers in Pure Digital Logic on FPGAs", "summary": "Independent hardware engineer Fabio Guzman introduced GateGPT, an open-source RTL implementation of microGPT synthesized onto a 16-year-old Xilinx Virtex-5 FPGA. Running at 80 MHz, GateGPT generates up to 69,200 tokens per second, demonstrating a GPU-free inference baseline for edge deployments by compiling a transformer directly into digital logic gates.", "body_md": "[AI](https://www.devclubhouse.com/c/ai)Article\n\n# GateGPT: Running Transformers in Pure Digital Logic on FPGAs\n\nBy synthesizing a GPT model directly into hardware, GateGPT achieves massive throughput at a fraction of the clock speed.\n\n[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)\n\nEdge AI is currently dominated by a software-on-hardware paradigm: we take massive general-purpose processors (GPUs, TPUs, or NPUs), load a software runtime, and stream model weights through memory. But what if the neural network *is* the hardware?\n\nIn June 2026, independent hardware engineer Fabio Guzman introduced **GateGPT**, an open-source Register Transfer Level (RTL) implementation of Andrej Karpathy's microGPT. Synthesized onto a 16-year-old [Xilinx](https://www.xilinx.com) Virtex-5 FPGA (XC5VLX110T) running at a modest 80 MHz, GateGPT generates up to 69,200 tokens per second, with a sustained average of approximately 60,600 tokens per second.\n\nThis is not just an impressive retro-hardware hack. GateGPT demonstrates a credible, GPU-free inference baseline for edge deployments. By compiling a transformer directly into digital logic gates, it bypasses the instruction-fetch overhead of CPUs and the massive power envelopes of GPUs, proving that highly optimized, application-specific logic can deliver blazing-fast inference on minimal power and clock budgets.\n\n## The Architecture of GateGPT: Microcode ROM and Datapath Actuators\n\nInstead of building a monolithic, rigid state machine to handle the transformer's operations, Guzman opted for a hybrid approach: a microcode-ROM sequencer architecture. This design is conceptually closer to a classic CPU than a traditional hardwired neural network accelerator.\n\nAt its core, a small program ROM contains macro-instructions. A micro-program counter fetches one macro-op per clock cycle, triggers a corresponding modular datapath actuator, and halts until it receives a \"done\" signal. This instruction schedule is encoded in a program ROM (`ucode.hex`\n\n), compiled by a custom assembler (`tools/ucode_asm.py`\n\n).\n\nThe heavy lifting is distributed across specialized, modular hardware blocks (actuators) that share a true dual-port Block RAM (BRAM) scratchpad called `vmem`\n\n. This scratchpad stores both active activations and the persistent Key-Value (KV) cache.\n\nThe actuators include:\n\n: A parallel multiply-accumulate tile designed for linear projections, capable of processing 24 lanes by 2 columns per cycle.`matvec`\n\n: An RMSNorm unit utilizing hardware-based unsigned division (`norm`\n\n`udiv`\n\n) and inverse square root (`isqrt`\n\n) primitives, processing 2 elements per cycle.: A single-position multi-head causal attention block equipped with per-head parallel dividers.`attn`\n\n: A fixed-point exponential calculator using a 17-entry lookup table combined with linear interpolation.`exp_unit`\n\n: A module that handles temperature-scaled softmax and Linear Congruential Generator (LCG) categorical sampling, or falls back to greedy argmax.`sampler`\n\n: Handle embedding lookups, residual additions, and ReLU activations.`embed`\n\nand`vecop`\n\n## The Hardware KV Cache and Q5.11 Fixed-Point Math\n\nTo fit a transformer into the limited logic of a 2008-era FPGA—occupying just 8% of the Virtex-5's resources—GateGPT employs aggressive optimization and strict numerical constraints.\n\n[Serverless Inference by DigitalOcean 55+ models, every modality. One API key, one bill.](https://www.devclubhouse.com/go/ad/13)\n\nThe model uses signed Q5.11 fixed-point arithmetic. This 16-bit format allocates 5 bits for the integer part (including the sign) and 11 bits for the fractional part. Fixed-point math completely eliminates the need for complex, area-heavy floating-point units (FPUs), allowing the arithmetic logic to be synthesized into simple, fast adder and multiplier trees.\n\nThe architectural crown jewel of GateGPT's performance is its hardware-native KV cache. In software-based inference, managing the KV cache involves complex memory pointer manipulation and dynamic allocation. In GateGPT, the KV cache is baked directly into the `vmem`\n\nBRAM. Instead of recomputing the entire context window (up to 16 tokens) for every newly generated token, the `attn`\n\nactuator calculates only the K and V projections for the current token and appends them to the pre-allocated cache lines in `vmem`\n\n.\n\nThrough nine distinct optimization stages, Guzman increased the design's throughput by 28x—climbing from an initial 2,433 tokens/sec to the peak 69,200 tokens/sec. This massive speedup highlights the raw efficiency of hardware-level pipelining and memory-bandwidth matching.\n\n## The Developer Angle: Compiling to Silicon vs. Edge NPUs\n\nFor developers building edge AI applications—such as robotics, IoT sensors, or embedded medical devices—GateGPT represents a fork in the road.\n\nCurrently, edge AI relies on microcontrollers or low-power NPUs running lightweight runtimes like [TensorFlow Lite](https://www.tensorflow.org/lite) or MicroTVM. While these platforms offer flexibility, they introduce layers of abstraction: compiler toolchains, runtime interpreters, and OS scheduling.\n\nGateGPT offers an alternative: compiling the model directly to Register Transfer Level (RTL) using [Python](https://www.python.org)-based reference models (like Karpathy's microGPT) and synthesizing it into Verilog or VHDL.\n\n| Dimension | Edge NPU / Microcontroller | RTL-Synthesized Transformer (GateGPT) |\n|---|---|---|\nLatency |\nMilliseconds (variable due to OS/runtime overhead) | Microseconds (deterministic, clock-cycle accurate) |\nPower Consumption |\nWatts (typically 1W to 15W) | Milliwatts (fraction of a watt at low clock speeds) |\nFlexibility |\nHigh (swap models by loading a new binary) | Low (requires re-synthesis and FPGA flashing) |\nHardware Cost |\nMedium to High (specialized silicon) | Low (can run on cheap, legacy, or radiation-hardened FPGAs) |\n\nIn practice, adopting a GateGPT-style workflow requires a shift in developer tooling. Instead of writing PyTorch code and exporting to ONNX, the workflow looks like this:\n\n**Train and Quantize**: Train a micro-model in PyTorch, quantizing weights to Q5.11 fixed-point.** Generate Microcode**: Use a tool like`ucode_asm.py`\n\nto compile the model's execution graph into a sequence of macro-instructions for the ROM.**Synthesize and Route**: Run the RTL through FPGA synthesis tools to map the actuators and memory blocks to the target silicon.** Deploy**: Flash the bitstream to the FPGA.\n\nThe obvious caveat is scale. GateGPT runs a tiny model (4,192 parameters, 27-character vocabulary). Scaling this architecture to a 1-billion parameter model is currently bottlenecked by FPGA on-chip memory (BRAM) capacity. However, for highly specialized, ultra-low-latency tasks—such as wake-word detection, real-time signal filtering, or local character-level parsing—this approach is unmatched.\n\n## A New Baseline for Edge AI\n\nGateGPT is a compelling proof of concept that challenges the assumption that AI inference requires massive, power-hungry processors. By proving that a full transformer with a KV cache can run efficiently at just 80 MHz on legacy hardware, it opens the door for a new class of deterministic, ultra-low-power edge AI devices. For developers willing to venture into RTL and hardware synthesis, the reward is inference speed and efficiency that software runtimes simply cannot match.\n\n## Sources & further reading\n\n[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)· Senior Editor\n\nMariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.\n\n## Discussion 0\n\nNo comments yet\n\nBe the first to weigh in.", "url": "https://wpnews.pro/news/gategpt-running-transformers-in-pure-digital-logic-on-fpgas", "canonical_source": "https://www.devclubhouse.com/a/gategpt-running-transformers-in-pure-digital-logic-on-fpgas", "published_at": "2026-06-20 04:23:18+00:00", "updated_at": "2026-06-20 04:39:40.079163+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-infrastructure"], "entities": ["Fabio Guzman", "GateGPT", "Xilinx", "Virtex-5", "Andrej Karpathy", "microGPT"], "alternates": {"html": "https://wpnews.pro/news/gategpt-running-transformers-in-pure-digital-logic-on-fpgas", "markdown": "https://wpnews.pro/news/gategpt-running-transformers-in-pure-digital-logic-on-fpgas.md", "text": "https://wpnews.pro/news/gategpt-running-transformers-in-pure-digital-logic-on-fpgas.txt", "jsonld": "https://wpnews.pro/news/gategpt-running-transformers-in-pure-digital-logic-on-fpgas.jsonld"}}