Resurrecting Kepler: Getting Modern LLMs Running on a GTX 770 (Kernel 7.x)

An engineer patched NVIDIA's proprietary driver to run modern LLM inference on a GeForce GTX 770 (Kepler) GPU with a Linux 7.x kernel. The fix involves a five-byte binary patch to libcuda.so that bypasses a buffer-size negotiation bug causing cuInit to fail with error 802. The project demonstrates that Kepler GPUs, abandoned by NVIDIA after driver 470.256.02, remain viable for small-to-medium LLM workloads, reducing e-waste.

⚠️ Experimental hack: Use on non-critical systems. Ensure you have backups. This patches a proprietary binary at the instruction level — no warranty, no support. Kepler GPUs 2012–2014 are e-waste by NVIDIA's timeline, but they are perfectly capable hardware for inference workloads. The GTX 770 has 1536 CUDA cores and 2 GB GDDR5 — enough for small-to-medium LLMs. This project proves that with a five-byte fix and some kernel backports, these GPUs can be kept useful on modern Linux systems, reducing e-waste and teaching real systems engineering along the way. Keep an NVIDIA GeForce GTX 770 GK104, sm 30 — a Kepler GPU abandoned by NVIDIA's driver stack after driver 470.256.02 and CUDA 10.2 — running CUDA workloads on a modern Linux kernel 6.15 → 7.x, Ubuntu 26.04 . Two problems made stock software a dead end: cuInit returns error 802 nvidia-smi works, every CUDA program fails with CUDA ERROR SYSTEM NOT YET INITIALIZED .The proprietary 470.256.02 driver source does not build against kernels ≥6.15 due to removed/renamed APIs. I used community-sourced patch sets primarily from Fedora/Debian packaging https://src.fedoraproject.org/rpms/nvidia-kmod by Joan Bruguera Mico and Andreas Beckmann to resolve issues like: screen info → sysfb primary display.screen del timer sync → timer delete sync follow pfn → unsafe follow pfn dma fence signal now returns void efi enabled cast and UBSAN mismatchesAfter these backports, nvidia-smi reports the GTX 770 correctly. But cuInit still fails. cuInit Error 802 All rm ioctl kernel calls return NV OK — the kernel module is fine. The failure lives in userspace. With gdb , I traced cuInit calling rm ioctl 0x2a twice; both calls succeed at the kernel level, yet the library still returns 802. Disassembly of the RM response handler in libcuda.so.470.256.02 : 3436a0: mov 0xc %rsp ,%eax ; load status from RM response 3436a4: cmp $0x2,%eax ; status == 2? 3436a7: je 3436f0 ; → return 802 3436a9: jbe 3436e0 ; status <= 1? 3436e0: cmp $0x1,%eax 3436e3: jne 3436c5 ; status = 1 → return 999 3436e5: xor %eax,%eax ; return 0 success ... 3436f0: add $0x18,%rsp 3436f4: mov $0x322,%eax ; return 802 3436f9: pop; ret Root cause: The Resource Manager firmware on Kepler returns status code 2 NV ERR BUFFER TOO SMALL for the second initialization rm ioctl . The library treats 1 and 4 as success, but 2 is fatal → 802. Likely a buffer-size negotiation mismatch between the GTX 770's VBIOS firmware and the final 470.x userspace library. NVIDIA never fixed it because Kepler was already on legacy support. The fix: One instruction at offset 0x3436f4 . Instead of mov $0x322, %eax return 802 , return 0: | Bytes | Instruction | | |---|---|---| | Before | b8 22 03 00 00 | mov $0x322, %eax | | After | 31 c0 90 90 90 | xor %eax, %eax; nop; nop; nop | Subsequent rm ioctl calls succeed — only this specific init ioctl is broken. Patch script: python /usr/bin/env python3 import shutil, os libpath = "/usr/lib/x86 64-linux-gnu/libcuda.so.470.256.02" backup path = libpath + ".bak" if not os.path.exists backup path : shutil.copy2 libpath, backup path with open libpath, "rb" as f: data = bytearray f.read offset = 0x3436f4 expected = bytes 0xb8, 0x22, 0x03, 0x00, 0x00 actual = data offset:offset+5 if actual == expected: data offset:offset+2 = bytes 0x31, 0xc0 data offset+2:offset+5 = bytes 0x90, 0x90, 0x90 print f"Patched: {actual.hex } - {data offset:offset+5 .hex }" elif actual :2 == bytes 0x31, 0xc0 : print "Already patched " else: print f"UNEXPECTED at 0x{offset:x}: {actual.hex }" exit 1 with open libpath, "wb" as f: f.write data sm 30 support was dropped in CUDA 11, so we need CUDA 10.2's ptxas . But nvcc rejects GCC 15 Ubuntu 26.04 default . clang++ bridges legacy CUDA 10.2 headers and modern system libraries. llama.cpp uses cg::this grid CUDA 11+ . Patched softmax.cu for CUDA 10.2: js // Before CUDA = 11.0 : const cg::grid group g = cg::this grid ; // After CUDA < 11.00 : const cg::thread block g = cg::this thread block ; Build flags: cmake .. -DLLAMA CUDA=ON \ -DCMAKE C COMPILER=clang -DCMAKE CXX COMPILER=clang++ \ -DCUDAToolkit ROOT=/usr/local/cuda-10.2 \ -DCMAKE CUDA COMPILER=clang++ \ -DCMAKE CUDA ARCHITECTURES=30 \ -DGGML CUDA GRAPHS=OFF -DGGML CUDA GRAPHS=OFF is critical — CUDA graph capture requires sm 35+ and crashes on sm 30. Hardware: GTX 770 2 GB VRAM , Ubuntu 26.04 , kernel 7.0.0-27 , llama.cpp c16c35b81 . | Quant | Test | t/s | |---|---|---| | Q4 K M | pp64 | 69.50±0.95 | | Q4 K M | tg512 | 25.84±0.20 | | Quant | Test | t/s | |---|---|---| | Q4 K M | pp64 | 39.03±1.09 | GPU offload gives ~1.8× speedup on prompt processing for this model. | Quant | Test | t/s | |---|---|---| | Q3 K M | pp64 | 36.18±0.33 | | Q3 K M | tg256 | 10.11±0.11 | Qwen 3B at Q4 K M 1.95 GiB exceeds 2 GB VRAM — Q3 K M 1.60 GiB is required for full offloading. bash $ nvidia-smi -L GPU 0: NVIDIA GeForce GTX 770 UUID: GPU-3a93c548-... $ /tmp/test cuinit cuInit=0 $ llama-bench --list-devices CUDA0: NVIDIA GeForce GTX 770 1998 MiB, ... Full working stack: kernel module → patched libcuda.so → CUDA 10.2 runtime → llama.cpp CUDA backend — all on Linux 7.x with a 2013 Kepler GPU. Register the patched driver with DKMS so module rebuilds happen automatically: sudo apt install dkms sudo dkms add nvidia/470.256.02 sudo dkms build nvidia/470.256.02 -k $ uname -r sudo dkms install nvidia/470.256.02 -k $ uname -r For the complete debugging log, kernel patch table, patch scripts, and build instructions, see the GitHub Gist https://gist.github.com/skyne/fa150c6e4b025903a2dc0d34d1d9065f .