Run your own local LLM with rate limits via API-keys

A developer released a small Ruby prototype for an OpenAI-compatible LLM proxy that enforces per-user rate limits using a refillable token bucket system. The proxy, built entirely with Ruby standard libraries and no external dependencies, assigns each bearer token its own bucket and returns an OpenAI-style limit message when tokens are exhausted. The tool allows users to control token refill rates and costs, making it suitable for managing access to local LLM instances.

Small Ruby prototype for an OpenAI-compatible LLM proxy with a refillable token bucket. It uses only Ruby standard libraries: no gems, no Rack, no WEBrick. BASE API URL=http://192.168.0.124:8888/v1 \ BASE API KEY=1mmer \ BASE MODEL=gemma4 \ ruby llm proxy.rb The proxy listens on 0.0.0.0:8899 by default. For your local LLM at 192.168.0.124:8888 , run the saved local setup: ./run local proxy.sh That starts the Ruby proxy at http://127.0.0.1:8899/v1 and forwards to http://192.168.0.124:8888/v1 . The saved local curl check is: ./curl local proxy.sh Manual equivalent: curl -sS -i -m 60 http://127.0.0.1:8899/v1/chat/completions \ -H 'Authorization: Bearer user-a' \ -H 'Content-Type: application/json' \ -d '{ "model": "gemma4", "messages": {"role": "user", "content": "Reply with exactly: proxy ok"} , "max tokens": 16 }' Verified result through the proxy: the upstream replied with proxy ok and the proxy returned X-RateLimit-Remaining: 0 with the local test bucket. Run the smoke test: ruby test llm proxy.rb MAX TOKENS=10 max saved tokens per user REFILL TOKENS=2 tokens added each refill REFILL INTERVAL SECONDS=300 5 minutes REQUEST TOKEN COST=1 cost per accepted completion request Each bearer token gets its own bucket. Requests without a bearer token are bucketed by remote IP. Set PROXY API KEYS=key1,key2 if the proxy should reject unknown client keys. When the bucket is empty, /v1/chat/completions and /v1/completions return a normal OpenAI-style assistant response: limit reached, wait 5 min curl http://localhost:8888/v1/chat/completions \ -H 'Authorization: Bearer user-a' \ -H 'Content-Type: application/json' \ -d '{ "model": "anything", "messages": {"role": "user", "content": "hello"} }' By default, one completion request costs REQUEST TOKEN COST bucket tokens. To charge roughly by prompt size plus expected output: TOKEN COST MODE=estimate RESPONSE TOKEN RESERVE=256 ruby llm proxy.rb This is only an approximation for the prototype.