When we shipped the first version of AI-generated replies for HelperX, each reply cost us about $0.011 in API spend. That sounds tiny until you multiply by 30 replies per slot per day times 200 active slots: roughly $66 per day, or ~$2,000 per month. Not catastrophic, but enough to eat into margins on the smaller plans.
A year later, we're spending $0.0009 per reply — a 12x reduction. Same model providers, similar reply quality, same throughput. The savings came from four optimization layers stacked on top of each other.
This is exactly what each layer does, the order we applied them, and the cost reduction each one produced.
The naive implementation looked like this:
async function generateReply(tweet, persona) {
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 200,
messages: [{
role: 'user',
content: `You are a ${persona.role} with tone level ${persona.tone}.
Reply to this tweet in 2-3 sentences:
Tweet: "${tweet.text}"
Author: @${tweet.author} (${tweet.followers} followers)
Reply should add value without being promotional.`
}],
});
return response.content[0].text;
}
Sonnet, fresh request every time, full system prompt baked into every call. Cost breakdown per reply:
The "overhead" includes retries, occasional context bloat from longer tweets, and a few percent failure rate that ate budget without producing output.
The first realization: not every reply needs the smartest model.
A reply to "AI is changing everything" doesn't need Sonnet-level reasoning. A reply to a detailed technical thread arguing two specific points might. We built a router that picks the model based on the complexity of the input tweet:
function routeModel(tweet) {
const complexityScore =
(tweet.text.length > 200 ? 2 : 0) +
(tweet.hasNumbers ? 1 : 0) +
(tweet.questionCount > 0 ? 1 : 0) +
(tweet.technicalKeywords > 2 ? 2 : 0);
if (complexityScore >= 4) return 'claude-sonnet-4-6';
if (complexityScore >= 2) return 'claude-haiku-4-5-20251001';
return 'claude-haiku-4-5-20251001'; // simpler tweets always get Haiku
}
We then validated reply quality across both models with a human-evaluated A/B test on 500 reply pairs. The results:
87% pass rate at the lower price tier is a no-brainer trade. The 5% rated much worse — Haiku failures — were exactly the high-complexity tweets, which is what the router catches.
The routing distribution in production:
Haiku pricing: $0.80/MTok input, $4/MTok output.
Per-reply cost after routing:
Already a 4x reduction. But we were paying mostly for the same input tokens over and over.
Anthropic's prompt caching lets you mark a portion of your prompt as cacheable. The first request pays the full input cost; subsequent requests within the cache TTL pay 10% of the input cost for the cached portion.
Our prompts had a long, mostly-stable system section explaining the persona, the rules, and a few examples — call it 600 tokens. The variable portion was the actual tweet (~50 tokens) plus persona settings (~20 tokens).
The naive structure:
// BAD: persona is at the end, can't be cached effectively
const messages = [{
role: 'user',
content: `${LONG_SYSTEM_INSTRUCTIONS}
Persona: ${persona.role}, tone ${persona.tone}
Tweet: ${tweet.text}`
}];
Restructured for cache hits:
const response = await anthropic.messages.create({
model: 'claude-haiku-4-5-20251001',
max_tokens: 200,
system: [
{
type: 'text',
text: LONG_SYSTEM_INSTRUCTIONS,
cache_control: { type: 'ephemeral' }, // mark for caching
},
{
type: 'text',
text: PERSONA_TEMPLATES_BLOCK, // also cacheable across personas
cache_control: { type: 'ephemeral' },
},
],
messages: [{
role: 'user',
content: `Persona: ${persona.role}, tone ${persona.tone}.
Tweet from @${tweet.author}: "${tweet.text}"`,
}],
});
Two cache blocks: a system block (the rules) and a persona templates block (per-persona context). Both are stable across many requests; only the per-tweet user message varies.
Cache hit rate after structuring this way: 94%.
Cost math with caching on Haiku:
This was a 17% additional reduction. Smaller than I expected, because the output tokens dominate the cost on short replies — caching only reduces input.
The real value of caching showed up at scale: at 200 slots × 30 replies/day, the bursts of similar requests within a 5-minute window all share cache. Off-peak hours don't benefit much, but reply queue bursts can compress input cost to nearly zero.
Here's the optimization that surprised everyone on the team: a lot of the tweets we were generating replies for were near-duplicates of each other.
In an active niche, you'll see the same news event tweeted by 8 different accounts in the same hour. Same topic, slightly different framing. Different authors, different audiences, but the underlying point is similar enough that the reply doesn't need to be generated from scratch.
We added an embedding-based deduplication layer in front of the generation step:
async function generateReplyWithDedup(tweet, persona) {
const embedding = await embedTweet(tweet.text);
// Search recent generated replies for near-matches
const cached = await findSimilarReply(embedding, persona.id, {
similarityThreshold: 0.93,
maxAgeHours: 6,
});
if (cached) {
return adaptReply(cached.reply, tweet); // light rewrite
}
const reply = await llmGenerate(tweet, persona);
await storeReplyEmbedding(embedding, reply, persona.id);
return reply;
}
The flow:
The adaptReply
step uses Haiku for a tiny, cheap transformation — replacing author handles, adjusting tense, swapping specific words. It costs roughly 1/5 of a full generation.
Cache hit rate on similarity: 32%.
That means 32% of our generation requests are now resolved by adapt instead of generate. Cost math:
A 25% reduction on top of caching. The embedding spend is negligible — adding $0.00001 per request to save $0.00050 across many is an excellent trade.
The team was nervous about deduplication killing reply quality. We A/B tested it for 30 days. The results:
Turns out the platform doesn't care that two of your replies on similar topics share a stylistic skeleton — humans do this all the time. As long as each individual reply reads as natural and on-topic for its specific tweet, the audit metrics don't move.
The fourth layer is small but adds up.
4a. Streaming with early termination
Many replies are shorter than max_tokens=200
. By streaming and inspecting tokens as they come, we can terminate generation when the model produces a natural stopping point (period followed by silence, or an explicit "[end]" token if we instruct it):
const stream = await anthropic.messages.stream({ model, messages, max_tokens: 200 });
let reply = '';
let consecutiveSpaces = 0;
for await (const event of stream) {
if (event.type === 'content_block_delta') {
const delta = event.delta.text;
reply += delta;
// Stop if reply ends with sentence and next tokens are filler
if (reply.length > 40 && /[.!?]\s*$/.test(reply)) {
consecutiveSpaces++;
if (consecutiveSpaces > 2) {
await stream.controller.abort();
break;
}
} else {
consecutiveSpaces = 0;
}
}
}
Saves about 12% of output tokens on average across our reply distribution.
4b. Adaptive max_tokens
Setting max_tokens=200
for every request is wasteful. The model often produces 60-80 tokens for short tweets. We pre-estimate based on the input:
function estimateMaxTokens(tweet, persona) {
const base = 80;
const tweetBoost = tweet.text.length > 150 ? 40 : 0;
const personaBoost = persona.verbosity === 'high' ? 40 : 0;
return Math.min(220, base + tweetBoost + personaBoost);
}
For most requests this caps at 120 tokens instead of 200. It doesn't directly reduce cost (you only pay for tokens generated, not requested), but it slightly improves quality — the model is less likely to ramble when the budget is tighter.
Combined savings from Layer 4: ~15% on output cost = roughly 10% on total per-reply cost.
Final cost: $0.00050 × 0.90 ≈ $0.00045
Wait — that's not the $0.0009 we ended with. Let me reconcile.
The above math optimistically assumes every reply goes through every layer perfectly. In production, you eat:
The blended production cost lands at $0.00088 per reply — close enough to call it $0.0009. Down from $0.011 starting point, which is a 12x reduction.
| Layer | Action | Per-reply cost | Reduction |
|---|---|---|---|
| 0 | Naive Sonnet, no caching | $0.0110 | — |
| 1 | Model routing (Haiku for 78%) | $0.00081 | 13.6x |
| 2 | Prompt caching (94% hit rate) | $0.00067 | 16.4x |
| 3 | Embedding deduplication (32% hit) | $0.00050 | 22x |
| 4 | Streaming + adaptive max_tokens | $0.00045 | 24.4x |
| Production overhead | Retries, failures, edge cases | $0.00088 | |
| 12.5x |
A few attempted optimizations that didn't pan out:
1. Self-hosted open-source models.
We tried Llama 3 70B and a few other open models for the Haiku tier of requests. The throughput was unpredictable (cold start latency, batching issues), the quality on short-form replies was noticeably worse, and the total cost when factoring in our own infrastructure wasn't competitive with Haiku's pricing.
Verdict: open models make sense at much higher volume than we run. Below ~100M tokens/day, hosted APIs win on price + quality + reliability.
2. Pre-generating reply pools.
The idea: generate 100 generic replies for common topics in advance, then pick the closest one. Tried it. The replies sounded canned because they weren't responsive to the actual tweet. Detection went up, quality went down, savings weren't worth it.
3. Using GPT-4o-mini or Gemini Flash as cheaper alternatives.
We tested cross-provider routing. Pricing was comparable to Haiku. Quality differences across providers were noticeable to our human evaluators on the same prompts. Sticking with one provider (Anthropic) eliminated a class of integration bugs and made the persona engine consistent.
4. Aggressive temperature reduction.
Lower temperature = more predictable output = potentially more cacheable. We tested temperature 0.3 vs 0.7. Lower temp made replies feel mechanical and reduced engagement metrics by 18%. The savings didn't justify the quality drop.
In retrospect:
The optimization math gets attractive when your LLM spend is:
If you're spending $50/month on LLMs, none of this is worth the engineering time. If you're spending $5,000/month, every percentage point of optimization is worth a sprint.
max_tokens
12x cost reduction is what it looks like when four small wins compound. None of these layers alone would have justified the work; together they make the unit economics of an AI-heavy SaaS work.
HelperX uses all four layers in production. Bring your own LLM API key — we pass through your provider rate at our optimization stack. Free 30-day trial.