{"slug": "ollama-doesn-t-know-its-gpu-is-on-another-machine", "title": "Ollama Doesn't Know Its GPU Is on Another Machine", "summary": "Ollama, an AI model server, ran on a MacBook with no NVIDIA GPU by using GTAP software to intercept CUDA calls and forward them to a remote DGX Spark workstation with a 128 GB Blackwell GPU. The setup required no code changes or CUDA installation on the MacBook, and Ollama was unaware the GPU was on another machine. This demonstrates that GTAP can turn any GPU into a network resource, enabling devices without dedicated hardware to run GPU-dependent applications.", "body_md": "On this page:\n\nWe started an Ollama container on a MacBook. There's no NVIDIA GPU, no CUDA toolkit, and macOS doesn't even have CUDA drivers. Ollama found an NVIDIA GPU anyway: A 128 GB Blackwell GPU on a DGX Spark across the network.\n\nGTAP enables this by intercepting CUDA calls and forwarding them to a remote server. It takes one command, requires no code changes, and the application has no idea.\n\nHere's what that looks like:\n\n[Seeing it in action](#seeing-it-in-action)\n\nRunning Ollama on a MacBook with no GPU via GTAP\n\nGTAP server TUI showing GPU utilization during remote Ollama inference\n\nThe setup is two machines: a ** DGX Spark** (NVIDIA's Grace Blackwell workstation, 128 GB of unified GPU memory) running the GTAP server, and a\n\n**MacBook** running Docker Desktop with no NVIDIA hardware and no CUDA installation.\n\nWith GTAP installed on the MacBook, starting Ollama is one command:\n\nGTAP acquires a lease on the remote GPU, starts the container, and injects an interceptor.\nOllama initializes, discovers \"its\" GPU (actually the DGX Spark's GB10), and begins serving.\nThe first command starts Ollama's server inside the container.\nTo load a model and interact with it, you open a second terminal and run `ollama run`\n\n:\n\nRunning llama3.1:8b via GTAP. The model pulls and generates a response, all computed on the remote GPU.\n\nEvery token is computed on the DGX Spark's GPU. Only the generated text crosses the network back to the MacBook. Token generation is fast and interactive.\n\nWant to go bigger?\nThe DGX Spark has 128 GB of memory. This is enough for [ mistral-large](https://ollama.com/library/mistral-large):\n\nRunning Mistral Large at 123 billion parameters. 73 GB of VRAM, streamed from the remote DGX Spark.\n\nMistral Large at 123 billion parameters requires roughly 73 GB of VRAM. The response streams back to the MacBook in real time, token by token.\n\nOn the server side, GTAP's terminal UI shows live GPU utilization, RPC call rate, network bandwidth, and scrollable logs. During inference, you can watch GPU utilization spike every time a response is generated, confirming that the compute is happening remotely.\n\nGTAP doesn't know or care what model Ollama is running. It just forwards CUDA calls. We've verified 48 models across 15 families, from SmolLM2 at 135M parameters to Qwen3.5 at 122B. All of them work without changes.\n\nYou may have noticed that the video uses a [custom container image](/blog/posts/ollama-remote-gpu/Dockerfile.txt) instead of the stock `ollama/ollama`\n\n.\nSince GTAP intercepts all CUDA libraries at the loader level, the container doesn't need a CUDA distribution at all.\nWe strip CUDA from the image entirely, bringing it from 8.7 GB (`ollama/ollama:0.17.7`\n\n) down to 1.2 GB.\nThe difference is just the toolkit you no longer have to ship.\nRemoving the NVIDIA Container Toolkit from the host also eliminates a [recurring](https://nvidia.custhelp.com/app/answers/detail/a_id/5582) [source](https://nvidia.custhelp.com/app/answers/detail/a_id/5659) of container escape vulnerabilities.\n\n[Why this matters](#why-this-matters)\n\nGPUs are expensive and hard to get. Most development machines don't have one, and cloud GPU instances bundle hardware you don't need with long-term commitments you can't avoid. GTAP turns a GPU into a network resource. Nothing about your application changes. The GPU just shows up wherever you need it: a fleet of laptops sharing a single GPU server, Kubernetes pods on nodes without NVIDIA drivers, or a CI runner that never had a GPU installed. None of this requires code changes, special drivers, or a CUDA installation on the client.\n\n[How this works](#how-this-works)\n\nGTAP decouples the GPU from the machine that runs the application.\nThe approach is called *API remoting*: GTAP intercepts CUDA API calls at the loader level and forwards them over the network to a server that has the actual GPU.\nThe application has no idea this is happening.\n\nCUDA calls are intercepted locally and forwarded to the remote GPU over the network.\n\nThis is not virtualization in the hypervisor sense, and it isn't the same as GPU sharing approaches like [MIG](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/), [vGPU](https://docs.nvidia.com/vgpu/latest/grid-licensing-user-guide/index.html), or PCIe passthrough, which all partition or assign a GPU to VMs on the *same physical machine*.\nThere's no virtual GPU device, no special driver, no modified application.\nGTAP operates below the application, at the boundary between the CUDA shared libraries and the process that calls them.\nWhen an application calls `cudaMalloc`\n\n, GTAP locally serializes the call and sends it over the network to the server.\nThe server executes the real `cudaMalloc`\n\non the physical GPU and returns the result.\nThe application receives a valid device pointer and continues as normal.\nGetting that interception to work cleanly turns out to be the tricky part.\n\nThe common way to intercept library calls on Linux is `LD_PRELOAD`\n\n, which loads a wrapper library before the real one.\nThe problem is that `LD_PRELOAD`\n\nhides symbols but doesn't replace the underlying library.\nCUDA still has to be installed locally, and the real CUDA libraries still initialize, which can cause conflicts when there is no actual GPU.\nGTAP uses a different mechanism: Linux's `ld.so`\n\naudit interface (`LD_AUDIT`\n\n).\nWhen the loader loads a CUDA library, GTAP's audit module redirects it to its own implementation and replaces CUDA functions with its own stubs.\nThe original CUDA libraries never load locally at all.\n\nGTAP currently supports the CUDA Driver API, Runtime API, cuBLAS, cuSPARSE, cuDNN, cuFFT, nvJPEG, and NVML, covering the vast majority of the CUDA ecosystem.\n\n[Overhead](#overhead)\n\nThe obvious concern with API remoting is latency. Every CUDA call crosses the network, so how does this not hurt performance?\n\nThe key is that most CUDA calls are asynchronous.\nKernel launches, `cudaMemcpyAsync`\n\n, and other stream operations return immediately to the application.\nGTAP sends them to the server asynchronously, maintaining execution order without blocking the application.\nThe application keeps submitting work while previous calls are still in flight on the GPU.\n\nNetwork latency only becomes visible on *synchronization points*: calls where the application explicitly waits for the GPU to finish.\nThe main ones are `cudaStreamSynchronize`\n\n, `cudaDeviceSynchronize`\n\n, and the synchronous variants of `cudaMemcpy`\n\n.\nThese block until all in-flight work on the relevant stream completes, and that round trip is where you pay the network cost.\n\nWell-written CUDA applications minimize synchronization.\nThey submit large batches of work, overlap compute with data transfers using multiple streams, and only synchronize when they need results.\n[Research on CUDA API remoting](https://doi.org/10.1002/cpe.6474) has shown that these workloads see negligible overhead on a low-latency network.\n\nFor Ollama, each generated token requires at least one synchronization to read back the result, so each token pays at least one network round trip. On a low-latency network this is still very usable, but token rate is sensitive to network latency. Fewer synchronization points mean less overhead. vLLM, for instance, achieves higher token rates over GTAP for exactly this reason.\n\nBeyond latency, there's also bandwidth to consider. Loading a model means transferring its weights to GPU memory, and a naive implementation would send them over the network every time. For Llama 3.1 8B, that's roughly 4.7 GB. For Mistral Large, it's around 73 GB. Over a gigabit Ethernet link, that's 38 seconds and nearly 10 minutes respectively. GTAP avoids this entirely. The GTAP server caches model weights on local disk. When Ollama loads a model that's already cached, GTAP reads it directly from the server's disk instead of transferring it over the network. The weights never cross the wire.\n\nWe benchmarked DeepSeek-R1 32B generating 512 tokens on a DGX Spark, comparing native execution with GTAP over a local network:\n\nDeepSeek-R1 32B total runtime, native vs GTAP over LAN.\n\nThe total overhead splits roughly evenly between model loading (9s) and token generation (12s). Both are areas we're actively working on improving.\n\nOnce every CUDA call flows through a single point, there's more you can do with it. GPU sharing, workload migration, and more. That's for another post.\n\n[Try it](#try-it)\n\nBeyond Ollama, we've tested GTAP with vLLM, ComfyUI, Stable Diffusion, and more.\nIf you're interested in trying it with your own workloads, [get in touch](https://gtap.ai).\n\nAnd yes, it runs DOOM.\n\nDOOM running on a MacBook via GTAP. The GPU is on a DGX Spark across the network.", "url": "https://wpnews.pro/news/ollama-doesn-t-know-its-gpu-is-on-another-machine", "canonical_source": "https://loopholelabs.io/blog/ollama-remote-gpu", "published_at": "2026-05-11 00:00:00+00:00", "updated_at": "2026-05-30 12:09:17.858275+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-tools", "large-language-models", "artificial-intelligence"], "entities": ["Ollama", "NVIDIA", "DGX Spark", "GTAP", "MacBook", "CUDA", "Blackwell", "GB10"], "alternates": {"html": "https://wpnews.pro/news/ollama-doesn-t-know-its-gpu-is-on-another-machine", "markdown": "https://wpnews.pro/news/ollama-doesn-t-know-its-gpu-is-on-another-machine.md", "text": "https://wpnews.pro/news/ollama-doesn-t-know-its-gpu-is-on-another-machine.txt", "jsonld": "https://wpnews.pro/news/ollama-doesn-t-know-its-gpu-is-on-another-machine.jsonld"}}