# Run your own local LLM with rate limits via API-keys

> Source: <https://github.com/skorotkiewicz/llm-rt>
> Published: 2026-05-27 18:39:25+00:00

Small Ruby prototype for an OpenAI-compatible LLM proxy with a refillable token bucket.

It uses only Ruby standard libraries: no gems, no Rack, no WEBrick.

```
BASE_API_URL=http://192.168.0.124:8888/v1 \
BASE_API_KEY=1mmer \
BASE_MODEL=gemma4 \
ruby llm_proxy.rb
```

The proxy listens on `0.0.0.0:8899`

by default.

For your local LLM at `192.168.0.124:8888`

, run the saved local setup:

```
./run_local_proxy.sh
```

That starts the Ruby proxy at `http://127.0.0.1:8899/v1`

and forwards to `http://192.168.0.124:8888/v1`

.

The saved local curl check is:

```
./curl_local_proxy.sh
```

Manual equivalent:

```
curl -sS -i -m 60 http://127.0.0.1:8899/v1/chat/completions \
  -H 'Authorization: Bearer user-a' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemma4",
    "messages": [{"role": "user", "content": "Reply with exactly: proxy ok"}],
    "max_tokens": 16
  }'
```

Verified result through the proxy: the upstream replied with `proxy ok`

and the proxy returned `X-RateLimit-Remaining: 0`

with the local test bucket.

Run the smoke test:

```
ruby test_llm_proxy.rb
MAX_TOKENS=10                 # max saved tokens per user
REFILL_TOKENS=2               # tokens added each refill
REFILL_INTERVAL_SECONDS=300   # 5 minutes
REQUEST_TOKEN_COST=1          # cost per accepted completion request
```

Each bearer token gets its own bucket. Requests without a bearer token are bucketed by remote IP. Set `PROXY_API_KEYS=key1,key2`

if the proxy should reject unknown client keys.

When the bucket is empty, `/v1/chat/completions`

and `/v1/completions`

return a normal OpenAI-style assistant response:

```
limit reached, wait 5 min
curl http://localhost:8888/v1/chat/completions \
  -H 'Authorization: Bearer user-a' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "anything",
    "messages": [{"role": "user", "content": "hello"}]
  }'
```

By default, one completion request costs `REQUEST_TOKEN_COST`

bucket tokens. To charge roughly by prompt size plus expected output:

```
TOKEN_COST_MODE=estimate RESPONSE_TOKEN_RESERVE=256 ruby llm_proxy.rb
```

This is only an approximation for the prototype.
