Show HN: Built AI-Gateway reverse proxy to reduce LLM API costs and token burn A developer released AI-Gateway, an open-source reverse proxy that uses semantic caching to reduce LLM API costs by 40-70% with no code changes. The tool sits between apps and providers like OpenAI and Groq, caching responses to similar questions so repeated queries don't incur API calls. It supports multi-tenant isolation, Redis caching, and features like rate limiting and circuit breakers. Cut your LLM API costs by 40-70% with zero code changes. A semantic caching layer that sits between your app and AI providers OpenAI, Groq, etc. . When you ask a similar question twice, it returns the cached answer instantly instead of calling the API again. You're building an AI app and your API bill is $500/month. 40-70% of that is for repeat questions : - "What is RAG?" asked 100 times = 100 API calls - "How do I reset my password?" asked 50 times = 50 API calls With AI Gateway: Those 150 calls become 2 calls one for each unique question . You save $200-350/month. How was your deployment experience? Takes 30 seconds. Helps us improve AI Gateway for everyone. What we want to know: - ⭐ How did deployment go? Excellent / Average / Bad - 🐛 Any problems you faced? - 💡 What features would you like to see? - 📊 How much are you saving on API costs? Your feedback directly shapes the roadmap. Steps: - Click the button above - Sign in with GitHub - Enter your API key Groq or OpenAI - Click "Deploy" - Done Your gateway is live at https://your-app.up.railway.app What you get: - ✅ Hosted gateway no server management - ✅ Redis included persistent cache - ✅ Auto-scaling - ✅ HTTPS enabled - ✅ $5/month free credit Steps: - Click the button - Sign in with GitHub - Add environment variable: UPSTREAM API KEY=your key - Click "Create Web Service" - Done Note: You'll need to add a Redis addon separately in Render dashboard. Prerequisites: - Docker installed - Docker Compose installed - A Groq or OpenAI API key Steps: 1. Clone the repo git clone https://github.com/Arnab758/ai-gateway.git cd ai-gateway 2. Set your API key export UPSTREAM API KEY=gsk your groq key here 3. Start everything gateway + Redis docker compose up -d 4. Verify it's running curl http://localhost:8080/health Expected response: {"status":"ok"} That's it Your gateway is now running at http://localhost:8080 Send a request through the gateway curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "X-Gateway-Token: my-app" \ -H "Authorization: Bearer sk-your-openai-or-groq-key" \ -d '{ "model": "gpt-4", "messages": {"role": "user", "content": "What is RAG?"} }' Send the SAME request again Response headers will show: X-Gateway-Cache: HIT You just saved money 💰 python import requests Your gateway URL from Railway/Render/Docker GATEWAY URL = "https://your-app.up.railway.app" API KEY = "sk-your-key" response = requests.post f"{GATEWAY URL}/v1/chat/completions", headers={ "Content-Type": "application/json", "X-Gateway-Token": "my-app", "Authorization": f"Bearer {API KEY}" }, json={ "model": "gpt-4", "messages": {"role": "user", "content": "What is RAG?"} } print response.json js const response = await fetch 'https://your-app.up.railway.app/v1/chat/completions', { method: 'POST', headers: { 'Content-Type': 'application/json', 'X-Gateway-Token': 'my-app', 'Authorization': 'Bearer sk-your-key' }, body: JSON.stringify { model: 'gpt-4', messages: { role: 'user', content: 'What is RAG?' } } } ; const data = await response.json ; console.log data ; Semantic Caching - Matches similar questions, not just exact duplicates- "What is RAG?" = "Explain RAG" = "RAG definition" Multi-Tenant - Each customer gets their own isolated cache 4-Tier Matching: - Exact match 100% identical - Template match "weather in London" = "weather in Paris" - Semantic match similar meaning - Word overlap partial matches Redis + In-Memory Fallback - Works with or without Redis Request Deduplication - 100 concurrent identical requests = 1 API call Rate Limiting - Prevent abuse per tenant Circuit Breaker - Automatically stops calling if provider is down Cost Tracking - See how much you saved Scenario: Customer support chatbot with 10,000 users Without AI Gateway: - 10,000 users ask 100 common questions each - 1,000,000 API calls/month - Cost: $500/month at $0.0005/call With AI Gateway: - First 100 questions: 100 API calls cache miss - Next 9,900 users asking same questions: 0 API calls cache hit - Total: 100 API calls/month - Cost: $0.05/month Savings: $499.95/month 99.99% Even with 30% unique questions: - 300,000 API calls - Cost: $150/month Savings: $350/month 70% Edit gateway.yaml to customize: cache: redis url: "redis://localhost:6379" Or your Redis URL vector: enabled: true similarity threshold: 0.85 85% similar = cache hit ttl hours: 24 Cache entries expire after 24 hours rate limiter: enabled: true max requests: 60 Per minute per tenant | Endpoint | Method | Description | |---|---|---| /v1/chat/completions | POST | Main proxy endpoint with caching | /health | GET | Health check | /stats | GET | Cache statistics | /metrics | GET | Prometheus metrics | curl http://localhost:8080/stats Response: { "uptime": 1234567890, "cache": { "local index entries": 150, "vector dimensions": 128, "vector threshold": 0.85, "jaccard threshold": 0.75, "template enabled": true, "dedup enabled": true, "ttl hours": 24 } } Every response includes cache information: X-Gateway-Cache: HIT or MISS X-Gateway-Similarity: 0.95 95% similar if HIT X-Gateway-Time-Saved: 1234ms Time saved if HIT Solution: Redis is optional The gateway will fall back to in-memory cache automatically. For production, add Redis: Railway: Add Redis from the "New" button Render: Add Redis from the "New" → "Database" → "Redis" Docker: Already included in docker-compose.yml Cause: You're hitting rate limits on free tier Groq/OpenAI Solutions: - Wait 1-2 minutes and try again - Upgrade to paid tier $0.002/request vs free limits - Add your own API key with higher limits Cause: Too many requests from one tenant Solution: Increase rate limits in gateway.yaml : rate limiter: max requests: 120 Increase from 60 window minutes: 1 Cause: Prompts are too different Solution: Lower the similarity threshold in gateway.yaml : cache: vector: similarity threshold: 0.75 Lower from 0.85 jaccard: threshold: 0.65 Lower from 0.75 Your App → AI Gateway → Cache Check → Redis ↓ Cache HIT → Return cached response instant, $0 ↓ Cache MISS → Call LLM Provider → Cache response → Return Contributions are welcome Please: - Fork the repo - Create a feature branch - Make your changes - Submit a pull request MIT License - feel free to use this commercially Issues: GitHub Issues https://github.com/Arnab758/ai-gateway/issues Discussions: GitHub Discussions https://github.com/Arnab758/ai-gateway/discussions Demo: Live Demo https://ai-gateway-production-c86a.up.railway.app/demo If this project helps you, please give it a star It helps others find it. Built with ❤️ for the AI community Questions? Open an issue and I'll respond within 24 hours.