Cut your LLM API costs by 40-70% with zero code changes.
A semantic caching layer that sits between your app and AI providers (OpenAI, Groq, etc.). When you ask a similar question twice, it returns the cached answer instantly instead of calling the API again.
You're building an AI app and your API bill is $500/month. 40-70% of that is for repeat questions:
- "What is RAG?" asked 100 times = 100 API calls
- "How do I reset my password?" asked 50 times = 50 API calls
With AI Gateway: Those 150 calls become 2 calls (one for each unique question). You save $200-350/month.
How was your deployment experience?
Takes 30 seconds. Helps us improve AI Gateway for everyone.
What we want to know:
- β How did deployment go? (Excellent / Average / Bad)
- π Any problems you faced?
- π‘ What features would you like to see?
- π How much are you saving on API costs?
Your feedback directly shapes the roadmap.
Steps:
- Click the button above
- Sign in with GitHub
- Enter your API key (Groq or OpenAI)
- Click "Deploy"
- Done! Your gateway is live at
https://your-app.up.railway.app
What you get:
- β Hosted gateway (no server management)
- β Redis included (persistent cache)
- β Auto-scaling
- β HTTPS enabled
- β $5/month free credit
Steps:
-
Click the button
-
Sign in with GitHub
-
Add environment variable:
UPSTREAM_API_KEY=your_key -
Click "Create Web Service"
-
Done!
Note: You'll need to add a Redis addon separately in Render dashboard.
Prerequisites:
- Docker installed
- Docker Compose installed
- A Groq or OpenAI API key
Steps:
git clone https://github.com/Arnab758/ai-gateway.git
cd ai-gateway
export UPSTREAM_API_KEY=gsk_your_groq_key_here
docker compose up -d
curl http://localhost:8080/health
That's it! Your gateway is now running at http://localhost:8080
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Gateway-Token: my-app" \
-H "Authorization: Bearer sk-your-openai-or-groq-key" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "What is RAG?"}]
}'
python
import requests
GATEWAY_URL = "https://your-app.up.railway.app"
API_KEY = "sk-your-key"
response = requests.post(
f"{GATEWAY_URL}/v1/chat/completions",
headers={
"Content-Type": "application/json",
"X-Gateway-Token": "my-app",
"Authorization": f"Bearer {API_KEY}"
},
json={
"model": "gpt-4",
"messages": [{"role": "user", "content": "What is RAG?"}]
}
)
print(response.json())
js
const response = await fetch('https://your-app.up.railway.app/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-Gateway-Token': 'my-app',
'Authorization': 'Bearer sk-your-key'
},
body: JSON.stringify({
model: 'gpt-4',
messages: [{ role: 'user', content: 'What is RAG?' }]
})
});
const data = await response.json();
console.log(data);
Semantic Caching- Matches similar questions, not just exact duplicates- "What is RAG?" = "Explain RAG" = "RAG definition"
Multi-Tenant- Each customer gets their own isolated cache** 4-Tier Matching:**- Exact match (100% identical)
- Template match ("weather in London" = "weather in Paris")
- Semantic match (similar meaning)
- Word overlap (partial matches)
Redis + In-Memory Fallback- Works with or without Redis** Request Deduplication**- 100 concurrent identical requests = 1 API call** Rate Limiting**- Prevent abuse per tenant** Circuit Breaker**- Automatically stops calling if provider is down** Cost Tracking**- See how much you saved
Scenario: Customer support chatbot with 10,000 users
Without AI Gateway:
- 10,000 users ask 100 common questions each
- 1,000,000 API calls/month
- Cost: $500/month (at $0.0005/call)
With AI Gateway:
- First 100 questions: 100 API calls (cache miss)
- Next 9,900 users asking same questions: 0 API calls (cache hit)
- Total: 100 API calls/month
- Cost: $0.05/month Savings: $499.95/month (99.99%)
Even with 30% unique questions:
- 300,000 API calls
- Cost: $150/month Savings: $350/month (70%)
Edit gateway.yaml
to customize:
cache:
redis_url: "redis://localhost:6379" # Or your Redis URL
vector:
enabled: true
similarity_threshold: 0.85 # 85% similar = cache hit
ttl_hours: 24 # Cache entries expire after 24 hours
rate_limiter:
enabled: true
max_requests: 60 # Per minute per tenant
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
||
| POST | Main proxy endpoint with caching | |
/health |
||
| GET | Health check | |
/stats |
||
| GET | Cache statistics | |
/metrics |
||
| GET | Prometheus metrics |
curl http://localhost:8080/stats
Response:
{
"uptime": 1234567890,
"cache": {
"local_index_entries": 150,
"vector_dimensions": 128,
"vector_threshold": 0.85,
"jaccard_threshold": 0.75,
"template_enabled": true,
"dedup_enabled": true,
"ttl_hours": 24
}
}
Every response includes cache information:
X-Gateway-Cache: HIT # or MISS
X-Gateway-Similarity: 0.95 # 95% similar (if HIT)
X-Gateway-Time-Saved: 1234ms # Time saved (if HIT)
Solution: Redis is optional! The gateway will fall back to in-memory cache automatically. For production, add Redis:
Railway: Add Redis from the "New" button
Render: Add Redis from the "New" β "Database" β "Redis"
Docker: Already included in docker-compose.yml
Cause: You're hitting rate limits on free tier (Groq/OpenAI)
Solutions:
- Wait 1-2 minutes and try again
- Upgrade to paid tier ($0.002/request vs free limits)
- Add your own API key with higher limits
Cause: Too many requests from one tenant
Solution: Increase rate limits in gateway.yaml
:
rate_limiter:
max_requests: 120 # Increase from 60
window_minutes: 1
Cause: Prompts are too different
Solution: Lower the similarity threshold in gateway.yaml
:
cache:
vector:
similarity_threshold: 0.75 # Lower from 0.85
jaccard:
threshold: 0.65 # Lower from 0.75
Your App β AI Gateway β [Cache Check] β Redis
β
[Cache HIT] β Return cached response (instant, $0)
β
[Cache MISS] β Call LLM Provider β Cache response β Return
Contributions are welcome! Please:
- Fork the repo
- Create a feature branch
- Make your changes
- Submit a pull request
MIT License - feel free to use this commercially!
Issues:GitHub Issues** Discussions:GitHub Discussions Demo:**Live Demo
If this project helps you, please give it a star! It helps others find it.
Built with β€οΈ for the AI community
Questions? Open an issue and I'll respond within 24 hours.