{"slug": "show-hn-built-ai-gateway-reverse-proxy-to-reduce-llm-api-costs-and-token-burn", "title": "Show HN: Built AI-Gateway reverse proxy to reduce LLM API costs and token burn", "summary": "A developer released AI-Gateway, an open-source reverse proxy that uses semantic caching to reduce LLM API costs by 40-70% with no code changes. The tool sits between apps and providers like OpenAI and Groq, caching responses to similar questions so repeated queries don't incur API calls. It supports multi-tenant isolation, Redis caching, and features like rate limiting and circuit breakers.", "body_md": "**Cut your LLM API costs by 40-70% with zero code changes.**\n\nA semantic caching layer that sits between your app and AI providers (OpenAI, Groq, etc.). When you ask a similar question twice, it returns the cached answer instantly instead of calling the API again.\n\nYou're building an AI app and your API bill is $500/month. 40-70% of that is for **repeat questions**:\n\n- \"What is RAG?\" asked 100 times = 100 API calls\n- \"How do I reset my password?\" asked 50 times = 50 API calls\n\n**With AI Gateway:** Those 150 calls become 2 calls (one for each unique question). You save $200-350/month.\n\n**How was your deployment experience?**\n\n*Takes 30 seconds. Helps us improve AI Gateway for everyone.*\n\n**What we want to know:**\n\n- ⭐ How did deployment go? (Excellent / Average / Bad)\n- 🐛 Any problems you faced?\n- 💡 What features would you like to see?\n- 📊 How much are you saving on API costs?\n\n**Your feedback directly shapes the roadmap.**\n\n**Steps:**\n\n- Click the button above\n- Sign in with GitHub\n- Enter your API key (Groq or OpenAI)\n- Click \"Deploy\"\n- Done! Your gateway is live at\n`https://your-app.up.railway.app`\n\n**What you get:**\n\n- ✅ Hosted gateway (no server management)\n- ✅ Redis included (persistent cache)\n- ✅ Auto-scaling\n- ✅ HTTPS enabled\n- ✅ $5/month free credit\n\n**Steps:**\n\n- Click the button\n- Sign in with GitHub\n- Add environment variable:\n`UPSTREAM_API_KEY=your_key`\n\n- Click \"Create Web Service\"\n- Done!\n\n**Note:** You'll need to add a Redis addon separately in Render dashboard.\n\n**Prerequisites:**\n\n- Docker installed\n- Docker Compose installed\n- A Groq or OpenAI API key\n\n**Steps:**\n\n```\n# 1. Clone the repo\ngit clone https://github.com/Arnab758/ai-gateway.git\ncd ai-gateway\n\n# 2. Set your API key\nexport UPSTREAM_API_KEY=gsk_your_groq_key_here\n\n# 3. Start everything (gateway + Redis)\ndocker compose up -d\n\n# 4. Verify it's running\ncurl http://localhost:8080/health\n\n# Expected response: {\"status\":\"ok\"}\n```\n\n**That's it!** Your gateway is now running at `http://localhost:8080`\n\n```\n# Send a request through the gateway\ncurl -X POST http://localhost:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-Gateway-Token: my-app\" \\\n  -H \"Authorization: Bearer sk-your-openai-or-groq-key\" \\\n  -d '{\n    \"model\": \"gpt-4\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"What is RAG?\"}]\n  }'\n\n# Send the SAME request again\n# Response headers will show: X-Gateway-Cache: HIT\n# You just saved money! 💰\npython\nimport requests\n\n# Your gateway URL (from Railway/Render/Docker)\nGATEWAY_URL = \"https://your-app.up.railway.app\"\nAPI_KEY = \"sk-your-key\"\n\nresponse = requests.post(\n    f\"{GATEWAY_URL}/v1/chat/completions\",\n    headers={\n        \"Content-Type\": \"application/json\",\n        \"X-Gateway-Token\": \"my-app\",\n        \"Authorization\": f\"Bearer {API_KEY}\"\n    },\n    json={\n        \"model\": \"gpt-4\",\n        \"messages\": [{\"role\": \"user\", \"content\": \"What is RAG?\"}]\n    }\n)\n\nprint(response.json())\njs\nconst response = await fetch('https://your-app.up.railway.app/v1/chat/completions', {\n  method: 'POST',\n  headers: {\n    'Content-Type': 'application/json',\n    'X-Gateway-Token': 'my-app',\n    'Authorization': 'Bearer sk-your-key'\n  },\n  body: JSON.stringify({\n    model: 'gpt-4',\n    messages: [{ role: 'user', content: 'What is RAG?' }]\n  })\n});\n\nconst data = await response.json();\nconsole.log(data);\n```\n\n**Semantic Caching**- Matches similar questions, not just exact duplicates- \"What is RAG?\" = \"Explain RAG\" = \"RAG definition\"\n\n**Multi-Tenant**- Each customer gets their own isolated cache** 4-Tier Matching:**- Exact match (100% identical)\n- Template match (\"weather in London\" = \"weather in Paris\")\n- Semantic match (similar meaning)\n- Word overlap (partial matches)\n\n**Redis + In-Memory Fallback**- Works with or without Redis** Request Deduplication**- 100 concurrent identical requests = 1 API call** Rate Limiting**- Prevent abuse per tenant** Circuit Breaker**- Automatically stops calling if provider is down** Cost Tracking**- See how much you saved\n\n**Scenario:** Customer support chatbot with 10,000 users\n\n**Without AI Gateway:**\n\n- 10,000 users ask 100 common questions each\n- 1,000,000 API calls/month\n- Cost: $500/month (at $0.0005/call)\n\n**With AI Gateway:**\n\n- First 100 questions: 100 API calls (cache miss)\n- Next 9,900 users asking same questions: 0 API calls (cache hit)\n- Total: 100 API calls/month\n- Cost: $0.05/month\n**Savings: $499.95/month (99.99%)**\n\n**Even with 30% unique questions:**\n\n- 300,000 API calls\n- Cost: $150/month\n**Savings: $350/month (70%)**\n\nEdit `gateway.yaml`\n\nto customize:\n\n```\ncache:\n  redis_url: \"redis://localhost:6379\"  # Or your Redis URL\n  vector:\n    enabled: true\n    similarity_threshold: 0.85  # 85% similar = cache hit\n  ttl_hours: 24  # Cache entries expire after 24 hours\n\nrate_limiter:\n  enabled: true\n  max_requests: 60  # Per minute per tenant\n```\n\n| Endpoint | Method | Description |\n|---|---|---|\n`/v1/chat/completions` |\nPOST | Main proxy endpoint with caching |\n`/health` |\nGET | Health check |\n`/stats` |\nGET | Cache statistics |\n`/metrics` |\nGET | Prometheus metrics |\n\n```\ncurl http://localhost:8080/stats\n```\n\nResponse:\n\n```\n{\n  \"uptime\": 1234567890,\n  \"cache\": {\n    \"local_index_entries\": 150,\n    \"vector_dimensions\": 128,\n    \"vector_threshold\": 0.85,\n    \"jaccard_threshold\": 0.75,\n    \"template_enabled\": true,\n    \"dedup_enabled\": true,\n    \"ttl_hours\": 24\n  }\n}\n```\n\nEvery response includes cache information:\n\n```\nX-Gateway-Cache: HIT          # or MISS\nX-Gateway-Similarity: 0.95    # 95% similar (if HIT)\nX-Gateway-Time-Saved: 1234ms  # Time saved (if HIT)\n```\n\n**Solution:** Redis is optional! The gateway will fall back to in-memory cache automatically. For production, add Redis:\n\n**Railway:** Add Redis from the \"New\" button\n**Render:** Add Redis from the \"New\" → \"Database\" → \"Redis\"\n**Docker:** Already included in `docker-compose.yml`\n\n**Cause:** You're hitting rate limits on free tier (Groq/OpenAI)\n\n**Solutions:**\n\n- Wait 1-2 minutes and try again\n- Upgrade to paid tier ($0.002/request vs free limits)\n- Add your own API key with higher limits\n\n**Cause:** Too many requests from one tenant\n\n**Solution:** Increase rate limits in `gateway.yaml`\n\n:\n\n```\nrate_limiter:\n  max_requests: 120  # Increase from 60\n  window_minutes: 1\n```\n\n**Cause:** Prompts are too different\n\n**Solution:** Lower the similarity threshold in `gateway.yaml`\n\n:\n\n```\ncache:\n  vector:\n    similarity_threshold: 0.75  # Lower from 0.85\n  jaccard:\n    threshold: 0.65  # Lower from 0.75\nYour App → AI Gateway → [Cache Check] → Redis\n                ↓\n            [Cache HIT] → Return cached response (instant, $0)\n                ↓\n            [Cache MISS] → Call LLM Provider → Cache response → Return\n```\n\nContributions are welcome! Please:\n\n- Fork the repo\n- Create a feature branch\n- Make your changes\n- Submit a pull request\n\nMIT License - feel free to use this commercially!\n\n**Issues:**[GitHub Issues](https://github.com/Arnab758/ai-gateway/issues)** Discussions:**[GitHub Discussions](https://github.com/Arnab758/ai-gateway/discussions)** Demo:**[Live Demo](https://ai-gateway-production-c86a.up.railway.app/demo)\n\nIf this project helps you, please give it a star! It helps others find it.\n\n**Built with ❤️ for the AI community**\n\n**Questions?** Open an issue and I'll respond within 24 hours.", "url": "https://wpnews.pro/news/show-hn-built-ai-gateway-reverse-proxy-to-reduce-llm-api-costs-and-token-burn", "canonical_source": "https://github.com/Arnab758/ai-gateway", "published_at": "2026-06-25 04:14:47+00:00", "updated_at": "2026-06-25 04:43:29.793806+00:00", "lang": "en", "topics": ["ai-tools", "ai-infrastructure", "large-language-models", "generative-ai"], "entities": ["OpenAI", "Groq", "Railway", "Render", "Redis", "Docker", "GitHub", "Arnab758"], "alternates": {"html": "https://wpnews.pro/news/show-hn-built-ai-gateway-reverse-proxy-to-reduce-llm-api-costs-and-token-burn", "markdown": "https://wpnews.pro/news/show-hn-built-ai-gateway-reverse-proxy-to-reduce-llm-api-costs-and-token-burn.md", "text": "https://wpnews.pro/news/show-hn-built-ai-gateway-reverse-proxy-to-reduce-llm-api-costs-and-token-burn.txt", "jsonld": "https://wpnews.pro/news/show-hn-built-ai-gateway-reverse-proxy-to-reduce-llm-api-costs-and-token-burn.jsonld"}}