GateGPT: Running Transformers in Pure Digital Logic on FPGAs

Independent hardware engineer Fabio Guzman introduced GateGPT, an open-source RTL implementation of microGPT synthesized onto a 16-year-old Xilinx Virtex-5 FPGA. Running at 80 MHz, GateGPT generates up to 69,200 tokens per second, demonstrating a GPU-free inference baseline for edge deployments by compiling a transformer directly into digital logic gates.

AI https://www.devclubhouse.com/c/ai Article GateGPT: Running Transformers in Pure Digital Logic on FPGAs By synthesizing a GPT model directly into hardware, GateGPT achieves massive throughput at a fraction of the clock speed. Mariana Souza https://www.devclubhouse.com/u/mariana souza Edge AI is currently dominated by a software-on-hardware paradigm: we take massive general-purpose processors GPUs, TPUs, or NPUs , load a software runtime, and stream model weights through memory. But what if the neural network is the hardware? In June 2026, independent hardware engineer Fabio Guzman introduced GateGPT , an open-source Register Transfer Level RTL implementation of Andrej Karpathy's microGPT. Synthesized onto a 16-year-old Xilinx https://www.xilinx.com Virtex-5 FPGA XC5VLX110T running at a modest 80 MHz, GateGPT generates up to 69,200 tokens per second, with a sustained average of approximately 60,600 tokens per second. This is not just an impressive retro-hardware hack. GateGPT demonstrates a credible, GPU-free inference baseline for edge deployments. By compiling a transformer directly into digital logic gates, it bypasses the instruction-fetch overhead of CPUs and the massive power envelopes of GPUs, proving that highly optimized, application-specific logic can deliver blazing-fast inference on minimal power and clock budgets. The Architecture of GateGPT: Microcode ROM and Datapath Actuators Instead of building a monolithic, rigid state machine to handle the transformer's operations, Guzman opted for a hybrid approach: a microcode-ROM sequencer architecture. This design is conceptually closer to a classic CPU than a traditional hardwired neural network accelerator. At its core, a small program ROM contains macro-instructions. A micro-program counter fetches one macro-op per clock cycle, triggers a corresponding modular datapath actuator, and halts until it receives a "done" signal. This instruction schedule is encoded in a program ROM ucode.hex , compiled by a custom assembler tools/ucode asm.py . The heavy lifting is distributed across specialized, modular hardware blocks actuators that share a true dual-port Block RAM BRAM scratchpad called vmem . This scratchpad stores both active activations and the persistent Key-Value KV cache. The actuators include: : A parallel multiply-accumulate tile designed for linear projections, capable of processing 24 lanes by 2 columns per cycle. matvec : An RMSNorm unit utilizing hardware-based unsigned division norm udiv and inverse square root isqrt primitives, processing 2 elements per cycle.: A single-position multi-head causal attention block equipped with per-head parallel dividers. attn : A fixed-point exponential calculator using a 17-entry lookup table combined with linear interpolation. exp unit : A module that handles temperature-scaled softmax and Linear Congruential Generator LCG categorical sampling, or falls back to greedy argmax. sampler : Handle embedding lookups, residual additions, and ReLU activations. embed and vecop The Hardware KV Cache and Q5.11 Fixed-Point Math To fit a transformer into the limited logic of a 2008-era FPGA—occupying just 8% of the Virtex-5's resources—GateGPT employs aggressive optimization and strict numerical constraints. Serverless Inference by DigitalOcean 55+ models, every modality. One API key, one bill. https://www.devclubhouse.com/go/ad/13 The model uses signed Q5.11 fixed-point arithmetic. This 16-bit format allocates 5 bits for the integer part including the sign and 11 bits for the fractional part. Fixed-point math completely eliminates the need for complex, area-heavy floating-point units FPUs , allowing the arithmetic logic to be synthesized into simple, fast adder and multiplier trees. The architectural crown jewel of GateGPT's performance is its hardware-native KV cache. In software-based inference, managing the KV cache involves complex memory pointer manipulation and dynamic allocation. In GateGPT, the KV cache is baked directly into the vmem BRAM. Instead of recomputing the entire context window up to 16 tokens for every newly generated token, the attn actuator calculates only the K and V projections for the current token and appends them to the pre-allocated cache lines in vmem . Through nine distinct optimization stages, Guzman increased the design's throughput by 28x—climbing from an initial 2,433 tokens/sec to the peak 69,200 tokens/sec. This massive speedup highlights the raw efficiency of hardware-level pipelining and memory-bandwidth matching. The Developer Angle: Compiling to Silicon vs. Edge NPUs For developers building edge AI applications—such as robotics, IoT sensors, or embedded medical devices—GateGPT represents a fork in the road. Currently, edge AI relies on microcontrollers or low-power NPUs running lightweight runtimes like TensorFlow Lite https://www.tensorflow.org/lite or MicroTVM. While these platforms offer flexibility, they introduce layers of abstraction: compiler toolchains, runtime interpreters, and OS scheduling. GateGPT offers an alternative: compiling the model directly to Register Transfer Level RTL using Python https://www.python.org -based reference models like Karpathy's microGPT and synthesizing it into Verilog or VHDL. | Dimension | Edge NPU / Microcontroller | RTL-Synthesized Transformer GateGPT | |---|---|---| Latency | Milliseconds variable due to OS/runtime overhead | Microseconds deterministic, clock-cycle accurate | Power Consumption | Watts typically 1W to 15W | Milliwatts fraction of a watt at low clock speeds | Flexibility | High swap models by loading a new binary | Low requires re-synthesis and FPGA flashing | Hardware Cost | Medium to High specialized silicon | Low can run on cheap, legacy, or radiation-hardened FPGAs | In practice, adopting a GateGPT-style workflow requires a shift in developer tooling. Instead of writing PyTorch code and exporting to ONNX, the workflow looks like this: Train and Quantize : Train a micro-model in PyTorch, quantizing weights to Q5.11 fixed-point. Generate Microcode : Use a tool like ucode asm.py to compile the model's execution graph into a sequence of macro-instructions for the ROM. Synthesize and Route : Run the RTL through FPGA synthesis tools to map the actuators and memory blocks to the target silicon. Deploy : Flash the bitstream to the FPGA. The obvious caveat is scale. GateGPT runs a tiny model 4,192 parameters, 27-character vocabulary . Scaling this architecture to a 1-billion parameter model is currently bottlenecked by FPGA on-chip memory BRAM capacity. However, for highly specialized, ultra-low-latency tasks—such as wake-word detection, real-time signal filtering, or local character-level parsing—this approach is unmatched. A New Baseline for Edge AI GateGPT is a compelling proof of concept that challenges the assumption that AI inference requires massive, power-hungry processors. By proving that a full transformer with a KV cache can run efficiently at just 80 MHz on legacy hardware, it opens the door for a new class of deterministic, ultra-low-power edge AI devices. For developers willing to venture into RTL and hardware synthesis, the reward is inference speed and efficiency that software runtimes simply cannot match. Sources & further reading Mariana Souza https://www.devclubhouse.com/u/mariana souza · Senior Editor Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon. Discussion 0 No comments yet Be the first to weigh in.