{"slug": "run-your-own-local-llm-with-rate-limits-via-api-keys", "title": "Run your own local LLM with rate limits via API-keys", "summary": "A developer released a small Ruby prototype for an OpenAI-compatible LLM proxy that enforces per-user rate limits using a refillable token bucket system. The proxy, built entirely with Ruby standard libraries and no external dependencies, assigns each bearer token its own bucket and returns an OpenAI-style limit message when tokens are exhausted. The tool allows users to control token refill rates and costs, making it suitable for managing access to local LLM instances.", "body_md": "Small Ruby prototype for an OpenAI-compatible LLM proxy with a refillable token bucket.\n\nIt uses only Ruby standard libraries: no gems, no Rack, no WEBrick.\n\n```\nBASE_API_URL=http://192.168.0.124:8888/v1 \\\nBASE_API_KEY=1mmer \\\nBASE_MODEL=gemma4 \\\nruby llm_proxy.rb\n```\n\nThe proxy listens on `0.0.0.0:8899`\n\nby default.\n\nFor your local LLM at `192.168.0.124:8888`\n\n, run the saved local setup:\n\n```\n./run_local_proxy.sh\n```\n\nThat starts the Ruby proxy at `http://127.0.0.1:8899/v1`\n\nand forwards to `http://192.168.0.124:8888/v1`\n\n.\n\nThe saved local curl check is:\n\n```\n./curl_local_proxy.sh\n```\n\nManual equivalent:\n\n```\ncurl -sS -i -m 60 http://127.0.0.1:8899/v1/chat/completions \\\n  -H 'Authorization: Bearer user-a' \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"model\": \"gemma4\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Reply with exactly: proxy ok\"}],\n    \"max_tokens\": 16\n  }'\n```\n\nVerified result through the proxy: the upstream replied with `proxy ok`\n\nand the proxy returned `X-RateLimit-Remaining: 0`\n\nwith the local test bucket.\n\nRun the smoke test:\n\n```\nruby test_llm_proxy.rb\nMAX_TOKENS=10                 # max saved tokens per user\nREFILL_TOKENS=2               # tokens added each refill\nREFILL_INTERVAL_SECONDS=300   # 5 minutes\nREQUEST_TOKEN_COST=1          # cost per accepted completion request\n```\n\nEach bearer token gets its own bucket. Requests without a bearer token are bucketed by remote IP. Set `PROXY_API_KEYS=key1,key2`\n\nif the proxy should reject unknown client keys.\n\nWhen the bucket is empty, `/v1/chat/completions`\n\nand `/v1/completions`\n\nreturn a normal OpenAI-style assistant response:\n\n```\nlimit reached, wait 5 min\ncurl http://localhost:8888/v1/chat/completions \\\n  -H 'Authorization: Bearer user-a' \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"model\": \"anything\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"hello\"}]\n  }'\n```\n\nBy default, one completion request costs `REQUEST_TOKEN_COST`\n\nbucket tokens. To charge roughly by prompt size plus expected output:\n\n```\nTOKEN_COST_MODE=estimate RESPONSE_TOKEN_RESERVE=256 ruby llm_proxy.rb\n```\n\nThis is only an approximation for the prototype.", "url": "https://wpnews.pro/news/run-your-own-local-llm-with-rate-limits-via-api-keys", "canonical_source": "https://github.com/skorotkiewicz/llm-rt", "published_at": "2026-05-27 18:39:25+00:00", "updated_at": "2026-05-27 18:45:22.754116+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools"], "entities": ["Ruby", "OpenAI"], "alternates": {"html": "https://wpnews.pro/news/run-your-own-local-llm-with-rate-limits-via-api-keys", "markdown": "https://wpnews.pro/news/run-your-own-local-llm-with-rate-limits-via-api-keys.md", "text": "https://wpnews.pro/news/run-your-own-local-llm-with-rate-limits-via-api-keys.txt", "jsonld": "https://wpnews.pro/news/run-your-own-local-llm-with-rate-limits-via-api-keys.jsonld"}}