{"slug": "resurrecting-kepler-getting-modern-llms-running-on-a-gtx-770-kernel-7-x", "title": "Resurrecting Kepler: Getting Modern LLMs Running on a GTX 770 (Kernel 7.x)", "summary": "An engineer patched NVIDIA's proprietary driver to run modern LLM inference on a GeForce GTX 770 (Kepler) GPU with a Linux 7.x kernel. The fix involves a five-byte binary patch to libcuda.so that bypasses a buffer-size negotiation bug causing cuInit to fail with error 802. The project demonstrates that Kepler GPUs, abandoned by NVIDIA after driver 470.256.02, remain viable for small-to-medium LLM workloads, reducing e-waste.", "body_md": "⚠️ Experimental hack: Use on non-critical systems. Ensure you have backups. This patches a proprietary binary at the instruction level — no warranty, no support.\n\nKepler GPUs (2012–2014) are e-waste by NVIDIA's timeline, but they are perfectly capable hardware for inference workloads. The GTX 770 has 1536 CUDA cores and 2 GB GDDR5 — enough for small-to-medium LLMs. This project proves that with a **five-byte fix** and some kernel backports, these GPUs can be kept useful on modern Linux systems, reducing e-waste and teaching real systems engineering along the way.\n\nKeep an **NVIDIA GeForce GTX 770 (GK104, sm_30)** — a Kepler GPU abandoned by NVIDIA's driver stack after driver 470.256.02 and CUDA 10.2 — running CUDA workloads on a modern Linux kernel (6.15 → 7.x, Ubuntu 26.04).\n\nTwo problems made stock software a dead end:\n\n`cuInit`\n\nreturns error 802`nvidia-smi`\n\nworks, every CUDA program fails with `CUDA_ERROR_SYSTEM_NOT_YET_INITIALIZED`\n\n.The proprietary 470.256.02 driver source does not build against kernels ≥6.15 due to removed/renamed APIs. I used community-sourced patch sets (primarily from [Fedora/Debian packaging](https://src.fedoraproject.org/rpms/nvidia-kmod) by Joan Bruguera Mico and Andreas Beckmann) to resolve issues like:\n\n`screen_info`\n\n→ `sysfb_primary_display.screen`\n\n`del_timer_sync`\n\n→ `timer_delete_sync`\n\n`follow_pfn`\n\n→ `unsafe_follow_pfn`\n\n`dma_fence_signal`\n\nnow returns void`efi_enabled`\n\ncast and UBSAN mismatchesAfter these backports, `nvidia-smi`\n\nreports the GTX 770 correctly. But `cuInit`\n\nstill fails.\n\n`cuInit`\n\nError 802\nAll `rm_ioctl`\n\nkernel calls return `NV_OK`\n\n— the kernel module is fine. The failure lives in userspace. With `gdb`\n\n, I traced `cuInit`\n\ncalling `rm_ioctl(0x2a)`\n\ntwice; both calls succeed at the kernel level, yet the library still returns 802.\n\nDisassembly of the RM response handler in `libcuda.so.470.256.02`\n\n:\n\n```\n3436a0: mov   0xc(%rsp),%eax      ; load status from RM response\n3436a4: cmp   $0x2,%eax           ; status == 2?\n3436a7: je    3436f0              ; → return 802\n3436a9: jbe   3436e0              ; status <= 1?\n3436e0: cmp   $0x1,%eax\n3436e3: jne   3436c5              ; status != 1 → return 999\n3436e5: xor   %eax,%eax           ; return 0 (success)\n...\n3436f0: add   $0x18,%rsp\n3436f4: mov   $0x322,%eax         ; return 802\n3436f9: pop; ret\n```\n\n**Root cause:** The Resource Manager firmware on Kepler returns status code `2`\n\n(`NV_ERR_BUFFER_TOO_SMALL`\n\n) for the second initialization `rm_ioctl`\n\n. The library treats `1`\n\nand `4`\n\nas success, but `2`\n\nis fatal → 802. Likely a buffer-size negotiation mismatch between the GTX 770's VBIOS firmware and the final 470.x userspace library. NVIDIA never fixed it because Kepler was already on legacy support.\n\n**The fix:** One instruction at offset `0x3436f4`\n\n. Instead of `mov $0x322, %eax`\n\n(return 802), return 0:\n\n| Bytes | Instruction | |\n|---|---|---|\n| Before | `b8 22 03 00 00` |\n`mov $0x322, %eax` |\n| After | `31 c0 90 90 90` |\n`xor %eax, %eax; nop; nop; nop` |\n\nSubsequent `rm_ioctl`\n\ncalls succeed — only this specific init ioctl is broken. Patch script:\n\n``` python\n#!/usr/bin/env python3\nimport shutil, os\n\nlibpath = \"/usr/lib/x86_64-linux-gnu/libcuda.so.470.256.02\"\nbackup_path = libpath + \".bak\"\n\nif not os.path.exists(backup_path):\n    shutil.copy2(libpath, backup_path)\n\nwith open(libpath, \"rb\") as f:\n    data = bytearray(f.read())\n\noffset = 0x3436f4\nexpected = bytes([0xb8, 0x22, 0x03, 0x00, 0x00])\nactual = data[offset:offset+5]\n\nif actual == expected:\n    data[offset:offset+2] = bytes([0x31, 0xc0])\n    data[offset+2:offset+5] = bytes([0x90, 0x90, 0x90])\n    print(f\"Patched: {actual.hex()} -> {data[offset:offset+5].hex()}\")\nelif actual[:2] == bytes([0x31, 0xc0]):\n    print(\"Already patched!\")\nelse:\n    print(f\"UNEXPECTED at 0x{offset:x}: {actual.hex()}\")\n    exit(1)\n\nwith open(libpath, \"wb\") as f:\n    f.write(data)\n```\n\nsm_30 support was dropped in CUDA 11, so we need CUDA 10.2's `ptxas`\n\n. But `nvcc`\n\nrejects GCC 15 (Ubuntu 26.04 default). **clang++** bridges legacy CUDA 10.2 headers and modern system libraries.\n\nllama.cpp uses `cg::this_grid()`\n\n(CUDA 11+). Patched `softmax.cu`\n\nfor CUDA 10.2:\n\n``` js\n// Before (CUDA >= 11.0):\nconst cg::grid_group g = cg::this_grid();\n\n// After (CUDA < 11.00):\nconst cg::thread_block g = cg::this_thread_block();\n```\n\nBuild flags:\n\n```\ncmake .. -DLLAMA_CUDA=ON \\\n  -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ \\\n  -DCUDAToolkit_ROOT=/usr/local/cuda-10.2 \\\n  -DCMAKE_CUDA_COMPILER=clang++ \\\n  -DCMAKE_CUDA_ARCHITECTURES=30 \\\n  -DGGML_CUDA_GRAPHS=OFF\n```\n\n`-DGGML_CUDA_GRAPHS=OFF`\n\nis critical — CUDA graph capture requires sm_35+ and crashes on sm_30.\n\nHardware: **GTX 770 (2 GB VRAM)**, **Ubuntu 26.04**, **kernel 7.0.0-27**, **llama.cpp c16c35b81**.\n\n| Quant | Test | t/s |\n|---|---|---|\n| Q4_K_M | pp64 | 69.50±0.95 |\n| Q4_K_M | tg512 | 25.84±0.20 |\n\n| Quant | Test | t/s |\n|---|---|---|\n| Q4_K_M | pp64 | 39.03±1.09 |\n\nGPU offload gives ~1.8× speedup on prompt processing for this model.\n\n| Quant | Test | t/s |\n|---|---|---|\n| Q3_K_M | pp64 | 36.18±0.33 |\n| Q3_K_M | tg256 | 10.11±0.11 |\n\nQwen 3B at Q4_K_M (1.95 GiB) exceeds 2 GB VRAM — Q3_K_M (1.60 GiB) is required for full offloading.\n\n``` bash\n$ nvidia-smi -L\nGPU 0: NVIDIA GeForce GTX 770 (UUID: GPU-3a93c548-...)\n\n$ /tmp/test_cuinit\ncuInit=0\n\n$ llama-bench --list-devices\nCUDA0: NVIDIA GeForce GTX 770 (1998 MiB, ...)\n```\n\nFull working stack: kernel module → patched `libcuda.so`\n\n→ CUDA 10.2 runtime → llama.cpp CUDA backend — all on Linux 7.x with a 2013 Kepler GPU.\n\nRegister the patched driver with DKMS so module rebuilds happen automatically:\n\n```\nsudo apt install dkms\nsudo dkms add nvidia/470.256.02\nsudo dkms build nvidia/470.256.02 -k $(uname -r)\nsudo dkms install nvidia/470.256.02 -k $(uname -r)\n```\n\nFor the complete debugging log, kernel patch table, patch scripts, and build instructions, see the [GitHub Gist](https://gist.github.com/skyne/fa150c6e4b025903a2dc0d34d1d9065f).", "url": "https://wpnews.pro/news/resurrecting-kepler-getting-modern-llms-running-on-a-gtx-770-kernel-7-x", "canonical_source": "https://dev.to/skyne/resurrecting-kepler-getting-modern-llms-running-on-a-gtx-770-kernel-7x-4na", "published_at": "2026-06-27 13:29:09+00:00", "updated_at": "2026-06-27 14:04:05.675513+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-infrastructure"], "entities": ["NVIDIA", "GeForce GTX 770", "Kepler", "CUDA", "Linux", "Joan Bruguera Mico", "Andreas Beckmann", "Fedora"], "alternates": {"html": "https://wpnews.pro/news/resurrecting-kepler-getting-modern-llms-running-on-a-gtx-770-kernel-7-x", "markdown": "https://wpnews.pro/news/resurrecting-kepler-getting-modern-llms-running-on-a-gtx-770-kernel-7-x.md", "text": "https://wpnews.pro/news/resurrecting-kepler-getting-modern-llms-running-on-a-gtx-770-kernel-7-x.txt", "jsonld": "https://wpnews.pro/news/resurrecting-kepler-getting-modern-llms-running-on-a-gtx-770-kernel-7-x.jsonld"}}