I'm a pre-final year student. I built Socra(https://socra-production.up.railway.app/) β a multi-agent LLM SaaS that interrogates your startup idea using 5 specialist AI personas before generating an architecture masterplan. It has paying users. It runs on Railway. And for the first two weeks of production, it was quietly broken in a way I didn't notice until real users hit it.
This is the story of how I built the 3-provider fallback chain (Anthropic β Google β Groq), what broke along the way, and the actual code that runs in production today.
When I first deployed Socra, the LLM routing was simple: one provider, one model, one API key. It worked fine in development.
Then real users started using it.
Groq's free tier is 6,000 tokens per minute. A single Socra masterplan pipeline β 5 specialist agents running in parallel, each with ~1,500 input tokens β consumes roughly 9,500 tokens in one burst. The math: 3 out of 5 agents were returning Error code: 429
on every session with any real traffic.
The app was showing agent cards to users. Some said "Error" in amber text. I thought it was a race condition. It wasn't. It was me naively assuming one free-tier API could handle a multi-agent pipeline.
The fix wasn't to optimize β it was to add redundancy.
The final production routing order:
1. Anthropic Claude Haiku β if ANTHROPIC_API_KEY is set
2. Google Gemini 2.0 Flash β if GOOGLE_API_KEY is set β production default
3. Groq LLaMA 3.1 8B β if GROQ_API_KEY is set β fallback
4. Stub mode β demo scenarios, no API key needed
Why this order? Cost and rate limits, not model quality:
| Provider | Model | Input $/MTok | Output $/MTok | Free tier TPM |
|---|---|---|---|---|
| Anthropic | claude-haiku-4-5 | $0.80 | $4.00 | None |
| gemini-2.0-flash | $0.075 | $0.30 | 1,000,000 | |
| Groq | llama-3.1-8b-instant | $0.06 | $0.06 | 6,000 |
Google's free tier is 150Γ more headroom than Groq for a pipeline that fires 5 LLM calls simultaneously. For a student-built SaaS where LLM cost needs to be near zero while testing, that's not a small difference β it's the difference between the app working and not working.
Every LLM call in the system goes through one of two entrypoints: _call_llm
(non-streaming, for structured JSON) and _stream_llm_tokens
(streaming, for conversation text). Both use the same routing logic:
async def _call_llm(system: str, messages: list[dict], max_tokens: int, json_mode: bool = False) -> str:
if settings.anthropic_api_key:
return await _call_anthropic(system, messages, max_tokens, json_mode)
elif settings.google_api_key:
return await _call_google(system, messages, max_tokens, json_mode)
elif settings.groq_api_key:
return await _call_groq(system, messages, max_tokens, json_mode)
else:
return _stub_response(messages)
Dead simple. The routing is just: which key is set? The first match wins.
Google AI Studio exposes an OpenAI-compatible endpoint. This means you don't need the Google SDK β just point the OpenAI SDK at a different base URL:
async def _call_google(system: str, messages: list[dict], max_tokens: int, json_mode: bool = False) -> str:
from openai import AsyncOpenAI
client = AsyncOpenAI(
api_key=settings.google_api_key,
base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
)
kwargs = {
"model": "gemini-2.0-flash",
"max_tokens": max_tokens,
"messages": [{"role": "system", "content": system}, *messages],
}
if json_mode:
kwargs["response_format"] = {"type": "json_object"}
response = await client.chat.completions.create(**kwargs)
return response.choices[0].message.content or ""
Same pattern works for streaming β just use stream=True
and iterate async for chunk in stream
.
This is a pattern worth knowing: Groq, Azure OpenAI, and Google AI Studio all support the OpenAI-compatible endpoint format. If you write against the OpenAI SDK with configurable base_url
and api_key
, you get multi-provider support with almost no extra code.
Here's where it got messy. After the multi-agent pipeline runs and generates a masterplan, Socra needs structured JSON back from the LLM β eval scores, assumption tracking, quick reply choices. The original approach was a separator in the stream:
Stream: "Here are my questions... ###JSON###{"eval_delta": {...}, "choices": [...]}"
This worked fine with Anthropic (Claude follows formatting instructions reliably). It broke completely with smaller models.
The 8B Groq model would occasionally include the separator, occasionally not, occasionally put it in the middle of a sentence. Parsing failed silently and choices
came back empty β users saw no quick reply options after the first message.
The fix: two separate calls.
async for token in _stream_llm_tokens(system, messages):
yield token
full_message += token
eval_data = await _call_llm(
system=eval_system_prompt,
messages=messages + [{"role": "assistant", "content": full_message}],
json_mode=True
)
The Anthropic path still uses the separator (it's reliable there and saves one API call). The Groq and Google paths use two calls. A bit more latency, zero parsing failures.
This one cost me 45 minutes.
After deploying to Railway, every LLM call was failing with Illegal header value
. The API key was correct β I'd copied it straight from the Groq console. Except I hadn't. I'd pasted it into Railway's Variables tab and there was an invisible \n
at the end.
The fix was two things:
.strip()
defensively in config.py
:
class Settings(BaseSettings):
groq_api_key: str = ""
anthropic_api_key: str = ""
google_api_key: str = ""
@validator('groq_api_key', 'anthropic_api_key', 'google_api_key', pre=True)
def strip_keys(cls, v):
return v.strip() if v else v
Now the app is defensive against copy-paste mistakes. The .strip()
costs nothing and prevents a class of errors that are genuinely hard to debug.
After adding Google as the second provider, I pushed to Railway and checked the logs. They said:
Using Groq LLaMA for LLM calls
But I'd set GOOGLE_API_KEY
. For two days I thought Google wasn't working. It was. The startup log was wrong.
The main.py
lifespan check had a bug:
if settings.anthropic_api_key:
logger.info("Using Anthropic Claude")
elif settings.groq_api_key: # β checked Groq before Google
logger.info("Using Groq LLaMA")
The actual routing in _call_llm
was correct (Google checked second, before Groq). But the log check had a different order β so if Groq was also set (it was), it logged "Using Groq" even though every actual call was going to Google.
Fix: mirror the routing logic exactly in the startup log.
Running 5 parallel specialist agents against Groq's 6k TPM free tier: the math never worked and I was pretending it did.
Each agent gets ~1,500 input tokens + generates ~400 output tokens = ~1,900 tokens per call. 5 parallel calls = 9,500 tokens launched simultaneously. Groq's rate limiter sees all 9,500 in the same minute window and rejects the overflow.
Three approaches I tried, in order:
Approach 1: Retry with backoff. Added 3-attempt retry with 4s/8s exponential backoff on 429 errors. Helped slightly. Didn't fix the underlying math.
Approach 2: Sequential execution with delays. Switched from asyncio.gather()
to sequential calls with 1.5s gaps between agents. This spread the token burst across multiple rate-limit windows. Worked on Groq, but added ~7.5s to the masterplan pipeline β noticeable.
Approach 3: Switch to Google. Google's free tier is 1,000,000 TPM. Problem disappeared entirely. Now Groq is the fallback, not the primary.
The real lesson: design for the rate limits of your fallback providers, not just your primary. Groq is fast and cheap but not meant for parallel multi-agent workloads on the free tier.
After switching to Google as the production default, I did a full token and cost breakdown per session:
| Stage | Input tokens | Output tokens |
|---|---|---|
| Conversation (7 turns avg) | ~16,700 | ~3,500 |
| 5 specialist agents | ~24,000 | ~3,500 |
| Synthesis | ~12,700 | ~2,500 |
| Devil's advocate | ~2,800 | ~600 |
| Total per session | ||
| ~56,200 | ||
| ~10,100 |
At Google Gemini Flash pricing ($0.075 input / $0.30 output per million tokens):
Input cost: 56,200 / 1,000,000 Γ $0.075 = $0.0042
Output cost: 10,100 / 1,000,000 Γ $0.30 = $0.0030
Total: ~$0.007 per session
Socra charges βΉ499 (~$6) for a full masterplan session. LLM cost per session: $0.007. That's 99.8% gross margin on the LLM cost alone.
Railway hosting is ~$30/month fixed. Break-even is roughly 6 paid sessions per month.
This math only works because of the provider choice. The same session on Anthropic Haiku costs ~$0.085 β 12Γ more expensive, which would put margins at ~98.6%. Still fine, but the point is: provider selection is a product decision, not just a technical one.
1. Design for multi-provider from day one. I added the fallback chain in Phase 3 after production broke. It should have been in the architecture from the start. The routing abstraction (_call_llm
with provider detection) is simple enough to add in 30 minutes β there's no reason to start with a single provider.
2. Test the rate limit math before deploying parallel calls. 5 parallel agents Γ 1,900 tokens = 9,500 tokens in one burst. Groq's free tier is 6,000 TPM. This is elementary arithmetic that I didn't do until users were getting errors.
3. Strip API keys at the config layer. .strip()
in your settings class is a 5-minute change that eliminates an entire class of deployment bugs.
4. Make your startup log mirror your routing logic exactly. A log that says "Using Groq" when you're actually using Google is worse than no log β it actively misleads debugging.
Socra is built on: FastAPI + React + PostgreSQL + Railway + LangGraph (for the multi-agent pipeline) + Langfuse v4 (for per-call LLM observability) + Clerk (auth) + Razorpay (payments). The LLM fallback chain described here handles all LLM calls across the entire system β conversation, agents, synthesis, pitch deck generation, and the tribunal verdict scoring.
The live app is at socra-production.up.railway.app. The approach described here β OpenAI-compatible endpoints, two-call structured output, provider detection at the config layer β is all running in production today.
I'm a pre-final year student at HBTU Kanpur building production ML systems. If you're working on something similar or have questions about the multi-agent architecture, I'm on LinkedIn and GitHub.