We run Gemini at scale across billions of posts

Modash runs Gemini LLMs against billions of social media posts daily to extract structured meaning from messy, multilingual content, replacing error-prone regex rules. The company processes massive inference volumes across a multi-cloud setup using AWS Batch jobs, GCP buckets, and Gemini's Batch API to avoid prohibitive costs. This pipeline improves data quality for customers by correctly interpreting nuanced statements like "this is not a sponsored post" that traditional pattern-matching would misclassify.

Using LLMs with billions of inputs in a multi-cloud setup At Modash https://www.modash.io/ we sit on top of a creator-discovery dataset that grows by millions of posts every day. A growing slice of that pipeline now runs through LLMs. This massive volume of inference adds up on our cloud bills and our operational complexity. In this article you will learn how we actually run an LLM against billions of inputs without going broke . Why We Use LLMs Is the AI hype worth it? Do LLMs have any real use beyond being a 24/7 chatbot? We think so, and over the last year we’ve shipped several production pipelines where LLMs are visibly improving the data we deliver to our customers. Several of those pipelines exist to extract structured meaning from messy, multilingual, multimodal social content. Historically these were patchworks of regex rules, keyword lists, and hand-coded extractors. They scaled in lines of code, not in coverage. Take a caption like “This is not a sponsored post” or “I’m not being paid for this promotion” : it contains every keyword the rules were looking for, but means the exact opposite. The only way to handle it correctly is to actually understand the content. Those false positives are unacceptable as they erode the perceived quality of a product customers are paying for . LLMs reframe these as language and vision tasks instead of pattern-matching ones. The tradeoff is cost, throughput, and validation — which is what the rest of this article is about. Our solution Our upstream data lives in Iceberg tables on S3 . Each LLM use case has a corresponding Airflow DAG that triggers PySpark ETL’s that read our curated tables and extracts the rows that need inference. The AWS Batch jobs generate the JSONL files with the Gemini prompts and stores them in different GCP buckets one per region to leverage as much as possible compute capacity, more on that below , a pub/sub detects the event and send the JSONL file to a Gemini Enterprise Agent Platform using the Batch API as it’s 50% cheaper https://ai.google.dev/gemini-api/docs/batch-api?batch=file . Gemini Enterprise formerly Vertex will read the model to be used from the path of the file and will store the output using the same partitioning strategy. From there, we run a periodic sync job that pulls those output JSONLs into S3 and lands them as Parquet. Each input row is identified by a unique ID that is also present in the LLM output. Finally, the data is ready to be used and our scheduled EMR jobs generate the data that we produce for our customers. Check this link if you want to lear more about how we optimize our EMR jobs. https://www.modash.io/engineering/lessons-from-spark-optimization What each Batch job does From there, for each Parquet file, Airflow triggers out one AWS Batch job that, at a high level, prepares our raw platform data so Gemini can digest it: Reads the necessary post data and handles heavy I/O tasks: like downloading and encoding media so it has to be parallel or the job’s resources will be infra-utilized. Encapsulates the data into Gemini requests: Each post or batch of posts is packaged into a single, self-contained request payload along with its prompt instructions and structured output schemas. Aggregates these request payloads and writes them into a large JSONL file that rotates when it hits ~900 MB. { "key": "XYZ", "request": { "contents": { "parts": { "text": "<batch request <post entry POST DATA </post entry </batch request " } , "role": "user" } , "generationConfig": { "temperature": 0, "maxOutputTokens": 8192, "responseMimeType": "application/json", "responseSchema": { "type": "object", "title": "SponsoredPostBatchResponse", "properties": { "results": { "type": "array", "title": "Results", "description": "List of analysis results. Must contain exactly one result per input post.", "items": { "type": "object", "title": "SponsorIdentification", "properties": { "PYDANTIC OBJECT SCHEMA" }, } } }, } }, "thinkingConfig": { "thinkingBudget": 0 }, "systemInstruction": { "parts": { "text": "OUR SPONSORED POST DETECTION PROMPT" } } } }, ... Why JSONLs are big The 900 MB number is deliberate — Gemini’s hard input cap is 1 GB https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/capabilities/batch-prediction-gemini . But packing requests into large JSONLs does not save us token money. What large JSONLs do save: Concurrent-batch-job quota . Gemini Enterprise enforces a cap on how many batch jobs we can have running per region simultaneously. A 1 KB file and a 900 MB file each consume one quota slot. GCS object operations . Gemini Enterprise writes results back as files mirroring the input shape. Thousands of micro-JSONLs become thousands of micro-output-files, all of which cost way more on GCS at billion-scale. Downstream simplicity . Fewer files for our ingestion job to enumerate, glob, and merge. Spreading load across GCP regions Gemini Enterprise AI Batch quotas are per-region , not per-project. If we naively upload all our JSONLs into one GCS bucket and run them in the same region as the bucket, we’ll hit the regional concurrency cap, queue up, and starve while other regions sit idle. To bypass this bottleneck, we architect our pipeline around GCP traffic routing tiers: Regional endpoints europe-west4 , us-central1 , … . Request lands in exactly that region. Data stays there, quota is counted there, and if the region is full we wait. e.g. Multi-region endpoints us , eu . Gemini Enterprise transparently shifts traffic between regions inside a single geography while keeping data within the geography’s residency boundary. The global endpoint . A single endpoint where Google routes requests to whichever region in the world currently has capacity. It’s the right decision when we are willing to spend latency and residency to get availability. The catch: not every model supports every endpoint type. So, for the models without multi-region support https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/gemini/2-5-flash , we provisioned GCS buckets across European regions one per supported region https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/gemini/2-5-flash . The batch job’s upload each JSONL to one of those buckets depending on their load: - First polls each region’s Gemini Enterprise Batch API. - Then counts how many of our batch jobs in that region are currently in QUEUED state. - When the Job is ready to upload a new JSONL, the tracker picks the least-loaded region instead of a blind random choice. Prompt engineering The cheapest performance lever we have is the prompt itself. Better prompts run cheaper than more thinking budget, more capable models, or more inference passes . After several painful iterations, the practices below are the ones that survived contact with production. Batch multiple rows per prompt When utilizing the Gemini Enterprise AI Batch API, costs are determined strictly by token volume: input tokens + output + thinking ones. Because you are billed for every single token that passes through the model, executing tasks one-by-one forces you to pay the full price of your static system guidelines over and over again. To eliminate this, multiple rows can be packed into the same prompt, allowing us to “cache” the fixed instruction block for standard real-time calls there is also the option of Caching https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/context-cache/context-cache-overview but, sadly, that’s not applicable to Batch API calls . This only pays off when the prompt is the heavy half of the request . For our multimodal jobs the input is by far the dominant token consumer simple instructions, huge input . There, batching multiple rows into one request would have marginally benefits. But, for cases where the instruction block is much larger than the per-row input, batching amortizes the prompt cost across many rows. The rule of thumb is: batch when prompt tokens dominate per-row input tokens; otherwise prefer one request per row for simplicity and debuggability. Real case example: Model : gemini-2.5-flash https://ai.google.dev/gemini-api/docs/models/gemini-2.5-flash Batch Input Rate: $0.15 per 1,000,000 tokens Batch Output Rate: $1.25 per 1,000,000 tokens System instructions prompt: 1,189 tokens Data Size per Post: 300 input tokens Output tokens per Post: 100 output tokens When we map these exact numbers out across different batch sizes, the law of diminishing returns https://en.wikipedia.org/wiki/Diminishing returns becomes obvious as it illustrates why 10 to 20 posts per prompt is our definitive sweet spot . Moving from 1 to 20 posts saves us thousands of dollars over millions of records. However, scaling past 20 pushes us straight into a flat plateau. Jamming 50 or 100 posts into a line introduces substantial engineering risk for less than a 2% financial gain. A few practical notes: Output token limit per request . Gemini models caps at ~65K output tokens . Input tokens also have a limit but its ~1M.- Add instructions like “you must return a result object for EVERY input post ID” and “Process each account in the list independently; do not infer connections” to ensure that each item of the batch is processed independently. - Always validate . Very large batches see more order-shuffling and occasional dropped result objects. result count == input count Force the model to think LLMs are lazy by default. If we ask a model for a straightforward categorical label or a simple binary verdict, it will happily emit that answer instantly without performing any real internal reasoning frequently leading to hallucinations or lazy, incorrect guesses . Even for non-reasoning models, forcing them to summarize evidence before labeling structured Chain-of-Thought prevents them from guessing. It allows the models to process the entire input before committing to a verdict. To apply the structured Chain-of-Thought pattern, we force the model to populate an analysis trace field before it reaches the final classification is a sponsored post? , we structurally prevent it from shooting from the hip. Those fields exist purely to force the model to verbalize the reasoning before it emits the classification. But, as nothing is more permanent as a temporary fix, adding those was really useful while debugging and while trying to understand the LLM reasoning. A few practical notes: Cap them tightly 20 words . Long-form chain-of-thought wastes tokens. Put them first in the schema definition . The model fills fields in the order they’re declared, so the reasoning has to land before the classification. Use them for debugging . When a result looks wrong, analysis trace tells whether it was a rule-interpretation issue fix the prompt or a genuine ambiguity in the source data fix the input filter . Invest in prompts, not in budget Gemini models expose a configuration lever called thinkingConfig that allows developers to adjust the model's internal reasoning budget https://ai.google.dev/gemini-api/docs/thinking thinking-levels . The intuitive engineering assumption is that a higher reasoning budget naturally yields more accurate answers. We tested that assumption thoroughly with our prompts and models, and we found, repeatedly, that for our use cases the quality difference between MINIMAL and HIGH was nearly zero but the cost difference was massive : Thinking tokens count toward our output bill, and at high thinking budgets they often outnumber the actual response tokens by 5–10×. If a model requires maximum compute to classify a post correctly, it is almost always evidence that the prompt logic is ambiguous, not that the model is lacking intelligence. A few practical notes: Explicit Edge Cases: Actively document real-world exceptions. Generalized Logic Filters: Create distinct, bulletproof classification guardrails with examples that minimize model ambiguity. Prompt iteration is several orders of magnitude cheaper than thinking-budget escalation: use thinkingLevel as a last-resort lever, not a first-pass dial. Set an output schema Every prompt in our pipeline ships with a strict Pydantic schema https://ai.google.dev/gemini-api/docs/structured-output?example=recipe passed alongside the request, and we keep the fields as restrictive as the use case allows. Doing so, we avoid wasting tokens because of malformed outputs, data quality errors in later stages of our pipelines and we prevent hallucinations because we limit the output. A few practical notes: Restrictive schemas don’t make the model dumber: they make its outputs operable . Make fields nullable when they don’t apply : this saves output tokens. Use enums or sets of values whenever possible. XML-tagged sections Large models respect XML tags as semantic boundaries far better than markdown headers. Wrapping the prompt’s parts in named tags produces measurably more consistent behavior than the same content written as a single long passage. With defined tags, the model stops conflating “what counts as a positive case” with “what the output should look like” , because those two things now live in syntactically distinct regions. There’s a useful side effect: prompt-injection resistance . User-supplied content is dangerous but with this approach it only appears inside the input data tag, and the system instruction explicitly says that anything inside that tag is data, not instructions. A few practical notes: Use semantic tag names, not generic ones : <classification logic , <output rules ,<sanitization step … carry signal; the model reads the tag name as part of the context for what’s inside. Conclusion In this article, we covered: - How we use GCP to run our production LLM pipelines while our core infrastructure remains on AWS. - How we optimize those pipelines for both cost and execution time at scale. - How to get the most out of prompts without relying on larger models or additional reasoning capabilities. This article summarizes just a few of the practices we use to run Gemini across billions of posts . We have many more to share, so let us know if you’d like to see a part 2.