# Show HN: Built AI-Gateway reverse proxy to reduce LLM API costs and token burn

> Source: <https://github.com/Arnab758/ai-gateway>
> Published: 2026-06-25 04:14:47+00:00

**Cut your LLM API costs by 40-70% with zero code changes.**

A semantic caching layer that sits between your app and AI providers (OpenAI, Groq, etc.). When you ask a similar question twice, it returns the cached answer instantly instead of calling the API again.

You're building an AI app and your API bill is $500/month. 40-70% of that is for **repeat questions**:

- "What is RAG?" asked 100 times = 100 API calls
- "How do I reset my password?" asked 50 times = 50 API calls

**With AI Gateway:** Those 150 calls become 2 calls (one for each unique question). You save $200-350/month.

**How was your deployment experience?**

*Takes 30 seconds. Helps us improve AI Gateway for everyone.*

**What we want to know:**

- ⭐ How did deployment go? (Excellent / Average / Bad)
- 🐛 Any problems you faced?
- 💡 What features would you like to see?
- 📊 How much are you saving on API costs?

**Your feedback directly shapes the roadmap.**

**Steps:**

- Click the button above
- Sign in with GitHub
- Enter your API key (Groq or OpenAI)
- Click "Deploy"
- Done! Your gateway is live at
`https://your-app.up.railway.app`

**What you get:**

- ✅ Hosted gateway (no server management)
- ✅ Redis included (persistent cache)
- ✅ Auto-scaling
- ✅ HTTPS enabled
- ✅ $5/month free credit

**Steps:**

- Click the button
- Sign in with GitHub
- Add environment variable:
`UPSTREAM_API_KEY=your_key`

- Click "Create Web Service"
- Done!

**Note:** You'll need to add a Redis addon separately in Render dashboard.

**Prerequisites:**

- Docker installed
- Docker Compose installed
- A Groq or OpenAI API key

**Steps:**

```
# 1. Clone the repo
git clone https://github.com/Arnab758/ai-gateway.git
cd ai-gateway

# 2. Set your API key
export UPSTREAM_API_KEY=gsk_your_groq_key_here

# 3. Start everything (gateway + Redis)
docker compose up -d

# 4. Verify it's running
curl http://localhost:8080/health

# Expected response: {"status":"ok"}
```

**That's it!** Your gateway is now running at `http://localhost:8080`

```
# Send a request through the gateway
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Gateway-Token: my-app" \
  -H "Authorization: Bearer sk-your-openai-or-groq-key" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "What is RAG?"}]
  }'

# Send the SAME request again
# Response headers will show: X-Gateway-Cache: HIT
# You just saved money! 💰
python
import requests

# Your gateway URL (from Railway/Render/Docker)
GATEWAY_URL = "https://your-app.up.railway.app"
API_KEY = "sk-your-key"

response = requests.post(
    f"{GATEWAY_URL}/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "X-Gateway-Token": "my-app",
        "Authorization": f"Bearer {API_KEY}"
    },
    json={
        "model": "gpt-4",
        "messages": [{"role": "user", "content": "What is RAG?"}]
    }
)

print(response.json())
js
const response = await fetch('https://your-app.up.railway.app/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-Gateway-Token': 'my-app',
    'Authorization': 'Bearer sk-your-key'
  },
  body: JSON.stringify({
    model: 'gpt-4',
    messages: [{ role: 'user', content: 'What is RAG?' }]
  })
});

const data = await response.json();
console.log(data);
```

**Semantic Caching**- Matches similar questions, not just exact duplicates- "What is RAG?" = "Explain RAG" = "RAG definition"

**Multi-Tenant**- Each customer gets their own isolated cache** 4-Tier Matching:**- Exact match (100% identical)
- Template match ("weather in London" = "weather in Paris")
- Semantic match (similar meaning)
- Word overlap (partial matches)

**Redis + In-Memory Fallback**- Works with or without Redis** Request Deduplication**- 100 concurrent identical requests = 1 API call** Rate Limiting**- Prevent abuse per tenant** Circuit Breaker**- Automatically stops calling if provider is down** Cost Tracking**- See how much you saved

**Scenario:** Customer support chatbot with 10,000 users

**Without AI Gateway:**

- 10,000 users ask 100 common questions each
- 1,000,000 API calls/month
- Cost: $500/month (at $0.0005/call)

**With AI Gateway:**

- First 100 questions: 100 API calls (cache miss)
- Next 9,900 users asking same questions: 0 API calls (cache hit)
- Total: 100 API calls/month
- Cost: $0.05/month
**Savings: $499.95/month (99.99%)**

**Even with 30% unique questions:**

- 300,000 API calls
- Cost: $150/month
**Savings: $350/month (70%)**

Edit `gateway.yaml`

to customize:

```
cache:
  redis_url: "redis://localhost:6379"  # Or your Redis URL
  vector:
    enabled: true
    similarity_threshold: 0.85  # 85% similar = cache hit
  ttl_hours: 24  # Cache entries expire after 24 hours

rate_limiter:
  enabled: true
  max_requests: 60  # Per minute per tenant
```

| Endpoint | Method | Description |
|---|---|---|
`/v1/chat/completions` |
POST | Main proxy endpoint with caching |
`/health` |
GET | Health check |
`/stats` |
GET | Cache statistics |
`/metrics` |
GET | Prometheus metrics |

```
curl http://localhost:8080/stats
```

Response:

```
{
  "uptime": 1234567890,
  "cache": {
    "local_index_entries": 150,
    "vector_dimensions": 128,
    "vector_threshold": 0.85,
    "jaccard_threshold": 0.75,
    "template_enabled": true,
    "dedup_enabled": true,
    "ttl_hours": 24
  }
}
```

Every response includes cache information:

```
X-Gateway-Cache: HIT          # or MISS
X-Gateway-Similarity: 0.95    # 95% similar (if HIT)
X-Gateway-Time-Saved: 1234ms  # Time saved (if HIT)
```

**Solution:** Redis is optional! The gateway will fall back to in-memory cache automatically. For production, add Redis:

**Railway:** Add Redis from the "New" button
**Render:** Add Redis from the "New" → "Database" → "Redis"
**Docker:** Already included in `docker-compose.yml`

**Cause:** You're hitting rate limits on free tier (Groq/OpenAI)

**Solutions:**

- Wait 1-2 minutes and try again
- Upgrade to paid tier ($0.002/request vs free limits)
- Add your own API key with higher limits

**Cause:** Too many requests from one tenant

**Solution:** Increase rate limits in `gateway.yaml`

:

```
rate_limiter:
  max_requests: 120  # Increase from 60
  window_minutes: 1
```

**Cause:** Prompts are too different

**Solution:** Lower the similarity threshold in `gateway.yaml`

:

```
cache:
  vector:
    similarity_threshold: 0.75  # Lower from 0.85
  jaccard:
    threshold: 0.65  # Lower from 0.75
Your App → AI Gateway → [Cache Check] → Redis
                ↓
            [Cache HIT] → Return cached response (instant, $0)
                ↓
            [Cache MISS] → Call LLM Provider → Cache response → Return
```

Contributions are welcome! Please:

- Fork the repo
- Create a feature branch
- Make your changes
- Submit a pull request

MIT License - feel free to use this commercially!

**Issues:**[GitHub Issues](https://github.com/Arnab758/ai-gateway/issues)** Discussions:**[GitHub Discussions](https://github.com/Arnab758/ai-gateway/discussions)** Demo:**[Live Demo](https://ai-gateway-production-c86a.up.railway.app/demo)

If this project helps you, please give it a star! It helps others find it.

**Built with ❤️ for the AI community**

**Questions?** Open an issue and I'll respond within 24 hours.
