The first time I ran the LLM scoring pipeline against our full backlog of job listings, I watched the OpenAI API costs climb in real time. What worked beautifully for 100 test listings was economically impossible at 10,000 per day.
This wasn't a side project. This was a production job board platform I was building for a client, processing listings from major ATS sources. Users needed scores. The client needed the system to be profitable. And I needed to rethink everything about how I was calling LLMs at scale.
The naive approach was simple: take a listing, send it to GPT-4 with a prompt, get a score back. Simple, but expensive. At scale, that pattern would have made the product economically unviable.
So I rebuilt the pipeline from the ground up. Here's what the final architecture looks like.
Before scoring anything, you need clean data at scale. The platform pulls from five major ATS providers using their public APIs. Greenhouse, Lever, Ashby, Workable, and Recruitee all expose listing data without OAuth, which makes ingestion straightforward but each returns a different shape of data.
The ingestion layer normalizes everything into a standard schema: title, company, description, location, posted date, and metadata. Then it writes to a MongoDB collection that the scoring pipeline reads from.
The first failure here was pagination. I was using MongoDB's skip()
for offset-based pagination when reading from the collection. At 1M+ documents, deep skip calls caused Atlas CPU spikes because skip()
doesn't skip computation. It scans every document up to the offset. The more listings we ingested, the worse it got.
The fix was cursor-based pagination using the _id
field. Instead of skipping, the query says "give me the next 100 documents after this one." No scanning. No CPU spikes. The change took an afternoon to implement and permanently solved a problem that had been causing weekly incidents.
But pagination was just the warm-up. The real challenge was still ahead.
For the scoring pipeline, I needed structured, predictable output. Freeform prompts with "return a JSON object" instructions are fragile. One day the LLM decides to add commentary. The next day it renames a key. That breaks downstream systems.
Function calling fixed this. Here's the schema I use for scoring a job listing against a candidate profile:
const scoringFunctions = [
{
name: 'score_job_match',
description: 'Score how well a job listing matches a candidate profile',
parameters: {
type: 'object',
properties: {
overall_score: {
type: 'number',
description: 'Match score from 0 to 100',
},
skill_match: {
type: 'number',
description: 'How well the candidate skills match requirements, 0-100',
},
experience_match: {
type: 'number',
description: 'How well candidate experience matches, 0-100',
},
location_match: {
type: 'number',
description: 'Location compatibility, 0-100',
},
reasons: {
type: 'array',
items: { type: 'string' },
description: 'Top 3 reasons for this score',
},
},
required: ['overall_score', 'skill_match', 'experience_match', 'location_match', 'reasons'],
},
},
];
With function calling, the LLM returns a deterministic JSON structure every time. No parsing errors. No hallucinations about the schema. Just clean data I can pipe directly into the database.
But even with perfect output structure, the cost problem remained.
This is where most people give up on production AI. They see the OpenAI bill, panic, and either kill the feature or ship a broken version. I went through both phases before landing on a working approach.
Strategy 1: Batch everything possible.
OpenAI's Batch API gives you 50% cost reduction in exchange for delayed processing. For scoring, that's fine. Listings don't need scores within seconds. They need scores within hours. The batch endpoint accepts the same payload as the real-time API. I queue up 500 scoring requests, submit them as a batch file, and collect the results 30 to 60 minutes later. The per-listing cost drops immediately and the throughput stays the same.
Strategy 2: Tier your models.
Not every listing needs a GPT-4 level analysis. Simple listings with clear skill requirements get scored with GPT-4o mini. Complex executive roles or ambiguous descriptions go to GPT-4. The routing logic is straightforward: if the description is under 500 words and the required skills are well-defined, use the cheap model. Otherwise, escalate.
This alone cut the average per-listing cost by about 70% without measurable accuracy loss. The key insight is that most data in most systems is simple. Only a fraction needs the heavy model. Design for the majority.
Strategy 3: Cache aggressively.
If a listing has been scored before and nothing changed, don't pay to score it again. I built a cache layer keyed on a hash of the listing content plus the candidate profile ID. The pipeline checks the cache before making any LLM call. Hit rate runs around 40% on repeat listings. That's 40% of requests that cost nothing.
Even with all three strategies, the client's AI rewrite pipeline got shut down. The cost at 1M+ listing scale was still too high for the budget. That's the reality of production AI. You don't solve cost once. You keep optimizing, or you find a model that's cheap enough and try again. That's what I'm evaluating with DeepSeek V4 Flash right now.
The scored listings don't just sit in a database. They're served through a REST API that downstream consumers query.
The API accepts filters for score ranges, locations, skills, and posting dates. Each endpoint logs query patterns so I can optimize indexes and cache popular queries. The response format is flat JSON with the score fields at the top level, making it easy for frontend developers and integration partners to consume without transformation.
// API response shape for scored listings
{
"id": "listing_abc123",
"title": "Senior Software Engineer",
"score": {
"overall": 87,
"skill": 92,
"experience": 85,
"location": 80,
"reasons": [
"Strong TypeScript and React experience matches requirements",
"5 years of backend experience aligns with senior role",
"Remote-first company, location not a barrier"
]
},
"posted_at": "2026-05-01T10:00:00Z"
}
The API layer also handles rate limiting, auth via API keys, and request validation. It's the part users and integrators see, so it has to be fast and reliable. Every endpoint returns scores within 50ms because all the LLM work happened hours ago during the batch window.
If I were starting this system today, I'd skip the GPT-4-only phase entirely and start with model tiering from day one. The cost of "let's just get it working with the best model" is real, and it creates pressure to optimize under fire instead of by design.
I'd also build the cache layer before the scoring pipeline, not after. Adding caching retroactively meant replaying scores that were already paid for. Building it first would have saved thousands in the first month.
And I'd have the cost conversation with the client earlier. The rewrite pipeline that got shut down would have been designed differently if we'd agreed on cost constraints before writing code. But that's a lesson in communication, not engineering.
Production AI isn't about using the smartest model. It's about designing for cost, latency, and reliability from day one. The model choice matters, but your architecture matters more.
If your team is building AI features and hitting the wall between "it works on my machine" and "it works at scale without burning cash", that's the kind of problem I help with. You can see how I build production AI pipelines at primestrides.com. Happy to compare notes on your specific challenges.
Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.