We run Gemini at scale across billions of posts Modash runs Gemini LLMs against billions of social media posts daily to extract structured meaning from messy, multilingual content, replacing error-prone regex rules. The company processes massive inference volumes across a multi-cloud setup using AWS Batch jobs, GCP buckets, and Gemini's Batch API to avoid prohibitive costs. This pipeline improves data quality for customers by correctly interpreting nuanced statements like "this is not a sponsored post" that traditional pattern-matching would misclassify. Using LLMs with billions of inputs in a multi-cloud setup At Modash https://www.modash.io/ we sit on top of a creator-discovery dataset that grows by millions of posts every day. A growing slice of that pipeline now runs through LLMs. This massive volume of inference adds up on our cloud bills and our operational complexity. In this article you will learn how we actually run an LLM against billions of inputs without going broke . Why We Use LLMs Is the AI hype worth it? Do LLMs have any real use beyond being a 24/7 chatbot? We think so, and over the last year we’ve shipped several production pipelines where LLMs are visibly improving the data we deliver to our customers. Several of those pipelines exist to extract structured meaning from messy, multilingual, multimodal social content. Historically these were patchworks of regex rules, keyword lists, and hand-coded extractors. They scaled in lines of code, not in coverage. Take a caption like “This is not a sponsored post” or “I’m not being paid for this promotion” : it contains every keyword the rules were looking for, but means the exact opposite. The only way to handle it correctly is to actually understand the content. Those false positives are unacceptable as they erode the perceived quality of a product customers are paying for . LLMs reframe these as language and vision tasks instead of pattern-matching ones. The tradeoff is cost, throughput, and validation — which is what the rest of this article is about. Our solution Our upstream data lives in Iceberg tables on S3 . Each LLM use case has a corresponding Airflow DAG that triggers PySpark ETL’s that read our curated tables and extracts the rows that need inference. The AWS Batch jobs generate the JSONL files with the Gemini prompts and stores them in different GCP buckets one per region to leverage as much as possible compute capacity, more on that below , a pub/sub detects the event and send the JSONL file to a Gemini Enterprise Agent Platform using the Batch API as it’s 50% cheaper https://ai.google.dev/gemini-api/docs/batch-api?batch=file . Gemini Enterprise formerly Vertex will read the model to be used from the path of the file and will store the output using the same partitioning strategy. From there, we run a periodic sync job that pulls those output JSONLs into S3 and lands them as Parquet. Each input row is identified by a unique ID that is also present in the LLM output. Finally, the data is ready to be used and our scheduled EMR jobs generate the data that we produce for our customers. Check this link if you want to lear more about how we optimize our EMR jobs. https://www.modash.io/engineering/lessons-from-spark-optimization What each Batch job does From there, for each Parquet file, Airflow triggers out one AWS Batch job that, at a high level, prepares our raw platform data so Gemini can digest it: Reads the necessary post data and handles heavy I/O tasks: like downloading and encoding media so it has to be parallel or the job’s resources will be infra-utilized. Encapsulates the data into Gemini requests: Each post or batch of posts is packaged into a single, self-contained request payload along with its prompt instructions and structured output schemas. Aggregates these request payloads and writes them into a large JSONL file that rotates when it hits ~900 MB. { "key": "XYZ", "request": { "contents": { "parts": { "text": "