cd /news/large-language-models/run-your-own-local-llm-with-rate-lim… · home topics large-language-models article
[ARTICLE · art-15630] src=github.com pub= topic=large-language-models verified=true sentiment=· neutral

Run your own local LLM with rate limits via API-keys

A developer released a small Ruby prototype for an OpenAI-compatible LLM proxy that enforces per-user rate limits using a refillable token bucket system. The proxy, built entirely with Ruby standard libraries and no external dependencies, assigns each bearer token its own bucket and returns an OpenAI-style limit message when tokens are exhausted. The tool allows users to control token refill rates and costs, making it suitable for managing access to local LLM instances.

read1 min publishedMay 27, 2026

Small Ruby prototype for an OpenAI-compatible LLM proxy with a refillable token bucket.

It uses only Ruby standard libraries: no gems, no Rack, no WEBrick.

BASE_API_URL=http://192.168.0.124:8888/v1 \
BASE_API_KEY=1mmer \
BASE_MODEL=gemma4 \
ruby llm_proxy.rb

The proxy listens on 0.0.0.0:8899

by default.

For your local LLM at 192.168.0.124:8888

, run the saved local setup:

./run_local_proxy.sh

That starts the Ruby proxy at http://127.0.0.1:8899/v1

and forwards to http://192.168.0.124:8888/v1

.

The saved local curl check is:

./curl_local_proxy.sh

Manual equivalent:

curl -sS -i -m 60 http://127.0.0.1:8899/v1/chat/completions \
  -H 'Authorization: Bearer user-a' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemma4",
    "messages": [{"role": "user", "content": "Reply with exactly: proxy ok"}],
    "max_tokens": 16
  }'

Verified result through the proxy: the upstream replied with proxy ok

and the proxy returned X-RateLimit-Remaining: 0

with the local test bucket.

Run the smoke test:

ruby test_llm_proxy.rb
MAX_TOKENS=10                 # max saved tokens per user
REFILL_TOKENS=2               # tokens added each refill
REFILL_INTERVAL_SECONDS=300   # 5 minutes
REQUEST_TOKEN_COST=1          # cost per accepted completion request

Each bearer token gets its own bucket. Requests without a bearer token are bucketed by remote IP. Set PROXY_API_KEYS=key1,key2

if the proxy should reject unknown client keys.

When the bucket is empty, /v1/chat/completions

and /v1/completions

return a normal OpenAI-style assistant response:

limit reached, wait 5 min
curl http://localhost:8888/v1/chat/completions \
  -H 'Authorization: Bearer user-a' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "anything",
    "messages": [{"role": "user", "content": "hello"}]
  }'

By default, one completion request costs REQUEST_TOKEN_COST

bucket tokens. To charge roughly by prompt size plus expected output:

TOKEN_COST_MODE=estimate RESPONSE_TOKEN_RESERVE=256 ruby llm_proxy.rb

This is only an approximation for the prototype.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/run-your-own-local-l…] indexed:0 read:1min 2026-05-27 ·