β οΈ Experimental hack: Use on non-critical systems. Ensure you have backups. This patches a proprietary binary at the instruction level β no warranty, no support.
Kepler GPUs (2012β2014) are e-waste by NVIDIA's timeline, but they are perfectly capable hardware for inference workloads. The GTX 770 has 1536 CUDA cores and 2 GB GDDR5 β enough for small-to-medium LLMs. This project proves that with a five-byte fix and some kernel backports, these GPUs can be kept useful on modern Linux systems, reducing e-waste and teaching real systems engineering along the way.
Keep an NVIDIA GeForce GTX 770 (GK104, sm_30) β a Kepler GPU abandoned by NVIDIA's driver stack after driver 470.256.02 and CUDA 10.2 β running CUDA workloads on a modern Linux kernel (6.15 β 7.x, Ubuntu 26.04).
Two problems made stock software a dead end:
cuInit
returns error 802nvidia-smi
works, every CUDA program fails with CUDA_ERROR_SYSTEM_NOT_YET_INITIALIZED
.The proprietary 470.256.02 driver source does not build against kernels β₯6.15 due to removed/renamed APIs. I used community-sourced patch sets (primarily from Fedora/Debian packaging by Joan Bruguera Mico and Andreas Beckmann) to resolve issues like:
screen_info
β sysfb_primary_display.screen
del_timer_sync
β timer_delete_sync
follow_pfn
β unsafe_follow_pfn
dma_fence_signal
now returns voidefi_enabled
cast and UBSAN mismatchesAfter these backports, nvidia-smi
reports the GTX 770 correctly. But cuInit
still fails.
cuInit
Error 802
All rm_ioctl
kernel calls return NV_OK
β the kernel module is fine. The failure lives in userspace. With gdb
, I traced cuInit
calling rm_ioctl(0x2a)
twice; both calls succeed at the kernel level, yet the library still returns 802.
Disassembly of the RM response handler in libcuda.so.470.256.02
:
3436a0: mov 0xc(%rsp),%eax ; load status from RM response
3436a4: cmp $0x2,%eax ; status == 2?
3436a7: je 3436f0 ; β return 802
3436a9: jbe 3436e0 ; status <= 1?
3436e0: cmp $0x1,%eax
3436e3: jne 3436c5 ; status != 1 β return 999
3436e5: xor %eax,%eax ; return 0 (success)
...
3436f0: add $0x18,%rsp
3436f4: mov $0x322,%eax ; return 802
3436f9: pop; ret
Root cause: The Resource Manager firmware on Kepler returns status code 2
(NV_ERR_BUFFER_TOO_SMALL
) for the second initialization rm_ioctl
. The library treats 1
and 4
as success, but 2
is fatal β 802. Likely a buffer-size negotiation mismatch between the GTX 770's VBIOS firmware and the final 470.x userspace library. NVIDIA never fixed it because Kepler was already on legacy support.
The fix: One instruction at offset 0x3436f4
. Instead of mov $0x322, %eax
(return 802), return 0:
| Bytes | Instruction | |
|---|---|---|
| Before | b8 22 03 00 00 |
|
mov $0x322, %eax |
||
| After | 31 c0 90 90 90 |
|
xor %eax, %eax; nop; nop; nop |
Subsequent rm_ioctl
calls succeed β only this specific init ioctl is broken. Patch script:
#!/usr/bin/env python3
import shutil, os
libpath = "/usr/lib/x86_64-linux-gnu/libcuda.so.470.256.02"
backup_path = libpath + ".bak"
if not os.path.exists(backup_path):
shutil.copy2(libpath, backup_path)
with open(libpath, "rb") as f:
data = bytearray(f.read())
offset = 0x3436f4
expected = bytes([0xb8, 0x22, 0x03, 0x00, 0x00])
actual = data[offset:offset+5]
if actual == expected:
data[offset:offset+2] = bytes([0x31, 0xc0])
data[offset+2:offset+5] = bytes([0x90, 0x90, 0x90])
print(f"Patched: {actual.hex()} -> {data[offset:offset+5].hex()}")
elif actual[:2] == bytes([0x31, 0xc0]):
print("Already patched!")
else:
print(f"UNEXPECTED at 0x{offset:x}: {actual.hex()}")
exit(1)
with open(libpath, "wb") as f:
f.write(data)
sm_30 support was dropped in CUDA 11, so we need CUDA 10.2's ptxas
. But nvcc
rejects GCC 15 (Ubuntu 26.04 default). clang++ bridges legacy CUDA 10.2 headers and modern system libraries.
llama.cpp uses cg::this_grid()
(CUDA 11+). Patched softmax.cu
for CUDA 10.2:
// Before (CUDA >= 11.0):
const cg::grid_group g = cg::this_grid();
// After (CUDA < 11.00):
const cg::thread_block g = cg::this_thread_block();
Build flags:
cmake .. -DLLAMA_CUDA=ON \
-DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ \
-DCUDAToolkit_ROOT=/usr/local/cuda-10.2 \
-DCMAKE_CUDA_COMPILER=clang++ \
-DCMAKE_CUDA_ARCHITECTURES=30 \
-DGGML_CUDA_GRAPHS=OFF
-DGGML_CUDA_GRAPHS=OFF
is critical β CUDA graph capture requires sm_35+ and crashes on sm_30.
Hardware: GTX 770 (2 GB VRAM), Ubuntu 26.04, kernel 7.0.0-27, llama.cpp c16c35b81.
| Quant | Test | t/s |
|---|---|---|
| Q4_K_M | pp64 | 69.50Β±0.95 |
| Q4_K_M | tg512 | 25.84Β±0.20 |
| Quant | Test | t/s |
|---|---|---|
| Q4_K_M | pp64 | 39.03Β±1.09 |
GPU offload gives ~1.8Γ speedup on prompt processing for this model.
| Quant | Test | t/s |
|---|---|---|
| Q3_K_M | pp64 | 36.18Β±0.33 |
| Q3_K_M | tg256 | 10.11Β±0.11 |
Qwen 3B at Q4_K_M (1.95 GiB) exceeds 2 GB VRAM β Q3_K_M (1.60 GiB) is required for full off.
$ nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 770 (UUID: GPU-3a93c548-...)
$ /tmp/test_cuinit
cuInit=0
$ llama-bench --list-devices
CUDA0: NVIDIA GeForce GTX 770 (1998 MiB, ...)
Full working stack: kernel module β patched libcuda.so
β CUDA 10.2 runtime β llama.cpp CUDA backend β all on Linux 7.x with a 2013 Kepler GPU.
Register the patched driver with DKMS so module rebuilds happen automatically:
sudo apt install dkms
sudo dkms add nvidia/470.256.02
sudo dkms build nvidia/470.256.02 -k $(uname -r)
sudo dkms install nvidia/470.256.02 -k $(uname -r)
For the complete debugging log, kernel patch table, patch scripts, and build instructions, see the GitHub Gist.