{"slug": "show-hn-llmhop-a-tiny-stateless-router-for-llms-with-a-nixos-module", "title": "Show HN: LLMhop – A tiny, stateless router for LLMs with a NixOS module", "summary": "A developer has released LLMhop, a stateless HTTP router that directs OpenAI-compatible API requests to the appropriate LLM inference backend based on the model name in the request body. The single-binary tool, built in pure Go with no external dependencies, can route requests to multiple backends including vLLM, Ollama, and OpenAI itself, and ships with a NixOS module for hardened deployment. The project addresses the need for a lightweight, model-aware gateway when running multiple single-model inference servers behind a unified endpoint.", "body_md": "One port, many models: A tiny, stateless HTTP router for OpenAI-compatible LLM inference backends.\n\nLLMhop peeks at the `model`\n\nfield of an incoming OpenAI-compatible request and reverse-proxies it to the matching backend.\nIt is primarily designed for single-model inference servers like [vLLM](https://github.com/vllm-project/vllm) and [sglang](https://github.com/sgl-project/sglang) that serve one model per process and need a thin model-aware gateway in front of them, but it works with any OpenAI-compatible backend (including multi-model servers and hosted providers) whenever you want to consolidate several upstreams behind a single endpoint.\n\n- OpenAI-compatible reverse proxy, model router and request dispatcher for self-hosted LLM inference.\n- Stateless single-binary HTTP service: no database, no cache, no background workers, safe behind any load balancer.\n- Zero external dependencies: pure Go, no third-party packages, no CGO.\n- Works with any OpenAI API-compatible backend, self-hosted or remote: vLLM, sglang, TabbyAPI, Aphrodite, Ollama, LocalAI, OpenRouter, together.ai, DeepInfra, etc.\n- Ships as a static binary, a minimal Docker image and a hardened NixOS module that can optionally spin up llama.cpp, sglang or vLLM workers alongside the router.\n\n- Client sends a request with a JSON body containing\n`{\"model\": \"...\"}`\n\n. - LLMhop reads the\n`model`\n\nfield and looks it up in its config. - The request is forwarded verbatim to the configured backend URL.\n- Unknown models return\n`404`\n\n.\n\nLLMhop can optionally gate incoming requests with a list of bearer tokens and inject per-model `Authorization`\n\n(or any other) headers when forwarding to the backend.\nBoth sides are opt-in: leave `authTokens`\n\nand `models.*.headers`\n\nunset and headers are forwarded verbatim.\n\nWhen `authTokens`\n\nis set, the router validates the incoming `Authorization: Bearer <token>`\n\nheader (constant-time compare) and then strips it before forwarding, so the client-facing token never leaks upstream.\nPer-model headers are applied last, so a configured `Authorization`\n\nalways wins over whatever the client sent.\n\nCreate a `config.json`\n\n:\n\n```\n{\n  \"listen\": \":8080\",\n  \"authTokens\": [\"${file:client_token}\"],\n  \"models\": {\n    \"llama-3-8b\": {\n      \"url\": \"http://localhost:30000\"\n    },\n    \"openai-gpt-4o\": {\n      \"url\": \"https://api.openai.com\",\n      \"headers\": {\n        \"Authorization\": \"Bearer ${env:OPENAI_KEY}\"\n      }\n    }\n  }\n}\n```\n\nString values inside `authTokens`\n\nand `models.*.headers`\n\nare expanded at startup, so no plaintext secret ever has to live in the config file:\n\n`${env:NAME}`\n\n: read from the`NAME`\n\nenvironment variable.`${file:path}`\n\n: read from a file. Relative paths are resolved against`$CREDENTIALS_DIRECTORY`\n\nwhen set (e.g. when launched by systemd with`LoadCredential=`\n\n), otherwise against the current working directory. A single trailing newline is trimmed.`$NAME`\n\n: shorthand for`${env:NAME}`\n\n.\n\nUnresolved references are a hard startup error.\n\nLLMhop buffers each request body in memory so it can peek at the `model`\n\nfield before forwarding.\nTo keep a single request from exhausting memory, the body is capped at 100 MiB by default; bodies beyond the cap are rejected with `413 Request Entity Too Large`\n\n.\nOverride it when vision or other multimodal payloads need more:\n\n```\n{ \"maxBodyBytes\": 524288000 }\n# native\nllmhop --config config.json\n\n# nix\nnix run github:mirkolenz/llmhop -- --config config.json\n\n# docker\ndocker run --rm -p 8080:8080 -v ./config.json:/config.json ghcr.io/mirkolenz/llmhop --config /config.json\n```\n\nA hardened systemd service is provided out of the box. Add LLMhop to your flake inputs and import the module into your system configuration:\n\n```\n{\n  inputs = {\n    nixpkgs.url = \"github:nixos/nixpkgs/nixos-unstable\";\n    llmhop = {\n      url = \"github:mirkolenz/llmhop\";\n      inputs.nixpkgs.follows = \"nixpkgs\";\n    };\n  };\n  outputs =\n    { nixpkgs, llmhop, ... }:\n    {\n      nixosConfigurations.myhost = nixpkgs.lib.nixosSystem {\n        system = \"x86_64-linux\";\n        modules = [\n          llmhop.nixosModules.default\n          {\n            services.llmhop = {\n              enable = true;\n              settings = {\n                listen = \":8080\";\n                models = {\n                  \"llama-3-8b\".url = \"http://localhost:30000\";\n                  \"qwen-2.5-7b\".url = \"http://localhost:30001\";\n                };\n              };\n            };\n          }\n        ];\n      };\n    };\n}\n```\n\nThe unit runs under `DynamicUser`\n\nwith aggressive sandboxing (`ProtectSystem`\n\n, `PrivateTmp`\n\n, restricted syscalls and address families, no new privileges, ...) and restarts on failure.\n\nThe module can also run the inference servers themselves, so you don't have to wire up llama.cpp, sglang or vLLM by hand.\nEach backend exposes a `models`\n\nattrset under `services.llmhop.<backend>`\n\nand every entry becomes one isolated worker bound to a loopback port, with the matching route registered automatically with llmhop.\nAll three backends can be enabled side by side and mixed freely in the same configuration.\n\nllama.cpp runs as a native, hardened systemd system unit under `DynamicUser`\n\n.\nsglang and vLLM are launched as rootless Podman containers through [quadlet-nix](https://github.com/mirkolenz/quadlet-nix).\nEach Quadlet backend gets a dedicated, lingering system user (`sglang`\n\n, `vllm`\n\n) that owns its cache directory, sub-UID range and rootless container store.\nThe container units are installed under that user's per-UID search path and therefore run as **systemd user units**, not system units.\nThis is a deliberate workaround for [NVIDIA/nvidia-container-toolkit#648](https://github.com/NVIDIA/nvidia-container-toolkit/issues/648):\n`nvidia-cdi-hook`\n\nruns as an OCI `createContainer`\n\nhook inside the container's user namespace and fails to read the OCI bundle's `config.json`\n\nwhenever Podman uses a UID-mapped namespace (e.g., `--userns auto`\n\nor `--userns nomap`\n\n), which is the mode you end up in when systemd's system manager launches a rootless container.\nRunning each Quadlet unit under a real, lingering system user's systemd instance keeps Podman in the `keep-id`\n\n-style mapping where the CDI hook can read the bundle and the GPU is correctly exposed.\nNo worker ever runs as root.\n\nFor convenience, the module injects a tiny per-backend helper into `environment.systemPackages`\n\nwhenever the backend's default user is used:\n\n`llama-cpp`\n\nworkers are plain system units, so they are managed with the usual`systemctl status llama-cpp-<model>`\n\nand`journalctl -u llama-cpp-<model>`\n\n.`sglang-shell`\n\nand`vllm-shell`\n\nare`writeShellApplication`\n\nwrappers around`machinectl shell`\n\nthat drop you into the backend user's session, where`systemctl --user`\n\n,`journalctl --user`\n\nand`podman ps`\n\nsee the worker units directly. Run them with no arguments for an interactive shell, or pass a command to execute it inside the session.\n\n```\nservices.llmhop = {\n  enable = true;\n  llama-cpp = {\n    enable = true;\n    models.\"qwen3-8b\" = {\n      port = 18001;\n      settings.hf-repo = \"unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL\";\n    };\n  };\n  sglang = {\n    enable = true;\n    models.\"qwen3-coder\" = {\n      port = 19001;\n      model = \"Qwen/Qwen3-8B\";\n      settings.reasoning-parser = \"qwen3\";\n    };\n  };\n  vllm = {\n    enable = true;\n    models.\"llama-3-8b\" = {\n      port = 20001;\n      model = \"meta-llama/Meta-Llama-3-8B-Instruct\";\n    };\n  };\n};\n```\n\nSee the [options reference](https://mirkolenz.github.io/llmhop/) for the full list of per-backend options.\n\nThe generated config file lives in the world-readable Nix store, so secrets should never be placed in `services.llmhop.settings`\n\ndirectly.\nInstead, reference them via `${file:...}`\n\nand hand the files to the service with systemd's `LoadCredential=`\n\n.\nThe right-hand side of each `LoadCredential`\n\nentry is just a file path, so anything that produces a file works: [agenix](https://github.com/ryantm/agenix) or [sops-nix](https://github.com/Mic92/sops-nix) outputs, a manually-managed file under `/etc/llmhop/`\n\n, or a path emitted by your own secret-provisioning tool.\n\n```\nservices.llmhop.settings = {\n  authTokens = [ \"\\${file:client_token}\" ];\n  models.\"openai-gpt-4o\" = {\n    url = \"https://api.openai.com\";\n    headers.Authorization = \"Bearer \\${env:OPENAI_KEY}\";\n  };\n};\n\nsystemd.services.llmhop.serviceConfig = {\n  LoadCredential = [ \"client_token:/etc/llmhop/client-token\" ];\n  EnvironmentFile = [ \"/etc/llmhop/openai.env\" ];\n};\n```\n\n`/etc/llmhop/openai.env`\n\nis a plain `KEY=VALUE`\n\nfile:\n\n```\nOPENAI_KEY=sk-...\n```\n\n`${file:...}`\n\nreferences are resolved against `$CREDENTIALS_DIRECTORY`\n\n, which systemd exposes as a per-unit tmpfs accessible only to this service, compatible with `DynamicUser`\n\nand the rest of the sandbox.\n`${env:...}`\n\npicks up anything the unit inherits, typically via `EnvironmentFile=`\n\n.\nPick whichever matches how your secret tooling hands you the data; mixing both in one config is fine.", "url": "https://wpnews.pro/news/show-hn-llmhop-a-tiny-stateless-router-for-llms-with-a-nixos-module", "canonical_source": "https://github.com/mirkolenz/llmhop", "published_at": "2026-06-05 00:26:09+00:00", "updated_at": "2026-06-05 00:47:16.322100+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools", "ai-products", "mlops"], "entities": ["LLMhop", "vLLM", "sglang", "TabbyAPI", "Aphrodite", "Ollama", "LocalAI", "OpenRouter"], "alternates": {"html": "https://wpnews.pro/news/show-hn-llmhop-a-tiny-stateless-router-for-llms-with-a-nixos-module", "markdown": "https://wpnews.pro/news/show-hn-llmhop-a-tiny-stateless-router-for-llms-with-a-nixos-module.md", "text": "https://wpnews.pro/news/show-hn-llmhop-a-tiny-stateless-router-for-llms-with-a-nixos-module.txt", "jsonld": "https://wpnews.pro/news/show-hn-llmhop-a-tiny-stateless-router-for-llms-with-a-nixos-module.jsonld"}}