Show HN: Built AI-Gateway reverse proxy to reduce LLM API costs and token burn

wpnews.pro

Cut your LLM API costs by 40-70% with zero code changes.

A semantic caching layer that sits between your app and AI providers (OpenAI, Groq, etc.). When you ask a similar question twice, it returns the cached answer instantly instead of calling the API again.

You're building an AI app and your API bill is $500/month. 40-70% of that is for repeat questions:

"What is RAG?" asked 100 times = 100 API calls
"How do I reset my password?" asked 50 times = 50 API calls

With AI Gateway: Those 150 calls become 2 calls (one for each unique question). You save $200-350/month.

How was your deployment experience?

Takes 30 seconds. Helps us improve AI Gateway for everyone.

What we want to know:

⭐ How did deployment go? (Excellent / Average / Bad)
🐛 Any problems you faced?
💡 What features would you like to see?
📊 How much are you saving on API costs?

Your feedback directly shapes the roadmap.

Steps:

Click the button above
Sign in with GitHub
Enter your API key (Groq or OpenAI)
Click "Deploy"
Done! Your gateway is live at https://your-app.up.railway.app

What you get:

✅ Hosted gateway (no server management)
✅ Redis included (persistent cache)
✅ Auto-scaling
✅ HTTPS enabled
✅ $5/month free credit

Steps:

Click the button
Sign in with GitHub
Add environment variable: UPSTREAM_API_KEY=your_key
Click "Create Web Service"
Done!

Note: You'll need to add a Redis addon separately in Render dashboard.

Prerequisites:

Docker installed
Docker Compose installed
A Groq or OpenAI API key

Steps:

git clone https://github.com/Arnab758/ai-gateway.git
cd ai-gateway

export UPSTREAM_API_KEY=gsk_your_groq_key_here

docker compose up -d

curl http://localhost:8080/health

That's it! Your gateway is now running at http://localhost:8080

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Gateway-Token: my-app" \
  -H "Authorization: Bearer sk-your-openai-or-groq-key" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "What is RAG?"}]
  }'

python
import requests

GATEWAY_URL = "https://your-app.up.railway.app"
API_KEY = "sk-your-key"

response = requests.post(
    f"{GATEWAY_URL}/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "X-Gateway-Token": "my-app",
        "Authorization": f"Bearer {API_KEY}"
    },
    json={
        "model": "gpt-4",
        "messages": [{"role": "user", "content": "What is RAG?"}]
    }
)

print(response.json())
js
const response = await fetch('https://your-app.up.railway.app/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-Gateway-Token': 'my-app',
    'Authorization': 'Bearer sk-your-key'
  },
  body: JSON.stringify({
    model: 'gpt-4',
    messages: [{ role: 'user', content: 'What is RAG?' }]
  })
});

const data = await response.json();
console.log(data);

Semantic Caching- Matches similar questions, not just exact duplicates- "What is RAG?" = "Explain RAG" = "RAG definition"

Multi-Tenant- Each customer gets their own isolated cache** 4-Tier Matching:**- Exact match (100% identical)

Template match ("weather in London" = "weather in Paris")
Semantic match (similar meaning)
Word overlap (partial matches)

Redis + In-Memory Fallback- Works with or without Redis** Request Deduplication**- 100 concurrent identical requests = 1 API call** Rate Limiting**- Prevent abuse per tenant** Circuit Breaker**- Automatically stops calling if provider is down** Cost Tracking**- See how much you saved

Scenario: Customer support chatbot with 10,000 users

Without AI Gateway:

10,000 users ask 100 common questions each
1,000,000 API calls/month
Cost: $500/month (at $0.0005/call)

With AI Gateway:

First 100 questions: 100 API calls (cache miss)
Next 9,900 users asking same questions: 0 API calls (cache hit)
Total: 100 API calls/month
Cost: $0.05/month Savings: $499.95/month (99.99%)

Even with 30% unique questions:

300,000 API calls
Cost: $150/month Savings: $350/month (70%)

Edit gateway.yaml

to customize:

cache:
  redis_url: "redis://localhost:6379"  # Or your Redis URL
  vector:
    enabled: true
    similarity_threshold: 0.85  # 85% similar = cache hit
  ttl_hours: 24  # Cache entries expire after 24 hours

rate_limiter:
  enabled: true
  max_requests: 60  # Per minute per tenant

Endpoint	Method	Description
`/v1/chat/completions`
POST	Main proxy endpoint with caching
`/health`
GET	Health check
`/stats`
GET	Cache statistics
`/metrics`
GET	Prometheus metrics

curl http://localhost:8080/stats

Response:

{
  "uptime": 1234567890,
  "cache": {
    "local_index_entries": 150,
    "vector_dimensions": 128,
    "vector_threshold": 0.85,
    "jaccard_threshold": 0.75,
    "template_enabled": true,
    "dedup_enabled": true,
    "ttl_hours": 24
  }
}

Every response includes cache information:

X-Gateway-Cache: HIT          # or MISS
X-Gateway-Similarity: 0.95    # 95% similar (if HIT)
X-Gateway-Time-Saved: 1234ms  # Time saved (if HIT)

Solution: Redis is optional! The gateway will fall back to in-memory cache automatically. For production, add Redis:

Railway: Add Redis from the "New" button Render: Add Redis from the "New" → "Database" → "Redis" Docker: Already included in docker-compose.yml

Cause: You're hitting rate limits on free tier (Groq/OpenAI)

Solutions:

Wait 1-2 minutes and try again
Upgrade to paid tier ($0.002/request vs free limits)
Add your own API key with higher limits

Cause: Too many requests from one tenant

Solution: Increase rate limits in gateway.yaml

:

rate_limiter:
  max_requests: 120  # Increase from 60
  window_minutes: 1

Cause: Prompts are too different

Solution: Lower the similarity threshold in gateway.yaml

:

cache:
  vector:
    similarity_threshold: 0.75  # Lower from 0.85
  jaccard:
    threshold: 0.65  # Lower from 0.75
Your App → AI Gateway → [Cache Check] → Redis
                ↓
            [Cache HIT] → Return cached response (instant, $0)
                ↓
            [Cache MISS] → Call LLM Provider → Cache response → Return

Contributions are welcome! Please:

Fork the repo
Create a feature branch
Make your changes
Submit a pull request

MIT License - feel free to use this commercially!

Issues:GitHub Issues** Discussions:GitHub Discussions Demo:**Live Demo

If this project helps you, please give it a star! It helps others find it.

Built with ❤️ for the AI community

Questions? Open an issue and I'll respond within 24 hours.

source & further reading

github.com — original article

Show HN: Built AI-Gateway reverse proxy to reduce LLM API costs and token burn

Run your AI side-project on zahid.host