Building a Serverless AI Model Evaluation Platform on AWS

Serverless AI model evaluation platform built on AWS by a media company to automatically compare podcast-style summaries generated by different foundation models. The system uses AWS Step Functions to orchestrate six Lambda functions that validate inputs, invoke multiple Bedrock models in parallel, score outputs with a separate AI judge, and produce an HTML comparison report—all triggered by a single API call. Key design choices include parallel model invocations within a single Lambda to reduce costs and wall-clock time, and using a separate scoring model to avoid self-evaluation bias.

The Problem A media company needed to evaluate which AI model produces the best podcast-style summaries from news articles. They wanted to: - Send an article to multiple AI models simultaneously - Compare the outputs side by side - Score each output automatically - Generate a visual comparison report Doing this manually, copying articles into different model playgrounds, reading outputs, judging quality, doesn't scale. They needed an automated evaluation pipeline that could run experiments on demand and produce consistent, comparable results. What We Built A fully serverless evaluation platform on AWS that accepts an article, runs it through multiple foundation models in parallel, scores each output using a separate AI judge, and produces an HTML comparison report. All triggered by a single API call. The system handles the entire lifecycle: - Prompt optimization — an AI agent refines the user's instructions into an effective prompt - Parallel model invocation — multiple Bedrock models generate summaries simultaneously - Automated scoring — a scoring agent evaluates each output against quality criteria - Report generation — produces a formatted HTML comparison page Architecture Overview The 6-Step Workflow The core of the system is a Step Functions state machine that orchestrates six Lambda functions in sequence. Here's what each step does and why it exists as a separate step. Step 1: Validate python def validate event : """Read and validate the experiment definition from S3.""" definition = s3.get object Bucket=BUCKET, Key=f"definitions/{experiment id}/definition.json" Validate required fields: article, models, prompt Fail fast if inputs are malformed return validated definition Why a separate step? Fail-fast validation before incurring any Bedrock costs. If the definition is malformed, we stop here — no wasted model invocations. Step 2: Invoke Models Parallel This is where it gets interesting. We invoke multiple Bedrock models simultaneously using Python's ThreadPoolExecutor : python from concurrent.futures import ThreadPoolExecutor, as completed def invoke models definition : models = definition 'models' e.g., "meta.llama3-70b", "deepseek-r1", "amazon.nova-lite" prompt = definition 'prompt' article = definition 'article' results = {} with ThreadPoolExecutor max workers=len models as executor: futures = { executor.submit invoke bedrock, model id, prompt, article : model id for model id in models } for future in as completed futures : model id = futures future response = future.result results model id = { "output": response 'output' 'message' 'content' 0 'text' , "usage": { "input tokens": response 'usage' 'inputTokens' , "output tokens": response 'usage' 'outputTokens' } } return results Why ThreadPoolExecutor inside Lambda? Bedrock API calls are I/O-bound. Running them in parallel within a single Lambda invocation means we pay for one Lambda execution instead of three, and the total wall-clock time is roughly equal to the slowest model rather than the sum of all models. Step 3: Store Outputs Writes comparison.json to S3 — containing all model outputs but no scores yet. This creates a checkpoint: if scoring fails, we don't lose the generated content. Step 4: Score Parallel The scoring agent Claude Haiku evaluates each model's output against quality criteria. Again, parallel execution via ThreadPoolExecutor: python def score outputs : scoring prompt = """Rate this podcast summary on: - Accuracy 1-10 : Does it faithfully represent the article? - Engagement 1-10 : Would a listener find this compelling? - Structure 1-10 : Is it well-organized for audio? Respond with JSON only.""" with ThreadPoolExecutor max workers=len outputs as executor: futures = { executor.submit invoke bedrock, SCORING MODEL, scoring prompt, output : model id for model id, output in outputs.items } ... collect scores Why a separate scoring model? Using a different model or at minimum, a separate invocation with a scoring-specific prompt as the judge avoids self-evaluation bias. The scoring agent doesn't know which model produced which output. Step 5: Store Scores Updates comparison.json with the scores attached to each model's output. Step 6: Generate HTML Produces a formatted comparison.html report that displays all outputs side by side with their scores. This is the final deliverable the user downloads. Why Amazon Bedrock's Converse API? We use the Converse API https://docs.aws.amazon.com/bedrock/latest/userguide/conversation-inference.html rather than the model-specific InvokeModel API. The key advantage: one unified interface across all models . python def invoke bedrock model id, system prompt, user message : response = bedrock runtime.converse modelId=model id, messages= {"role": "user", "content": {"text": user message} } , system= {"text": system prompt} return response Switching from Llama to Claude to Nova Lite requires changing only the model id string. No code changes, no different request formats, no response parsing differences. The Converse API also returns token usage in every response — which we pass through to the caller for billing: { "results": { "model id": "meta.llama3-70b-instruct-v1:0", "summary": "...", "usage": { "input tokens": 1523, "output tokens": 847 } } , "total usage": { "total input tokens": 4569, "total output tokens": 2541 } } Cost Control: The Hardest Part Here's the reality of building on top of foundation models: every API call costs money , and costs scale with input size. A single /run request invoking 3 models on a long article can cost $0.10–0.50. That sounds small until someone writes a script that calls it in a loop. Billing Alarms Day 1 We set up CloudWatch billing alarms immediately: CloudWatch Alarm $10 threshold → SNS → Email notification CloudWatch Alarm $25 threshold → SNS → Email notification This is the bare minimum. You'll know when costs are climbing, even if you can't stop them automatically. API Security Critical for Any AI-Backed API An unprotected API that invokes foundation models is essentially a public credit card. We learned this the hard way and now treat API security as P0 — before any external access: - API Keys on every endpoint immediate protection - Usage plans with per-key quotas 500 requests/day, 5000/month - Rate limiting 10 req/s throttle to prevent burst abuse - Request logging to attribute usage to specific callers Every request must include the API key curl -X POST https://api.example.com/run \ -H "x-api-key: btk live abc123def456" \ -H "Content-Type: application/json" \ -d '{"article": "...", "models": "meta.llama3-70b" }' Without this, anyone who discovers your API URL can generate unbounded Bedrock charges. Lessons Learned 1. Separate validation from execution Bedrock calls are expensive. Validate everything before invoking any model. Check that the article isn't empty, the model IDs are valid, the prompt isn't too long. Fail at Step 1, not Step 2. 2. ThreadPoolExecutor separate Lambda invocations for parallel model calls We considered using Step Functions' native parallel states or invoking separate Lambdas per model. ThreadPoolExecutor within a single Lambda turned out simpler: - One Lambda execution to pay for not N - Shared memory for the article text no repeated S3 reads - Simpler error handling - Total time ≈ slowest model, not sum of all The tradeoff: if one model times out, the entire Lambda times out. We mitigate this with per-future timeouts. 3. Store intermediate results Each step writes to S3 before the next step begins. If Step 4 scoring fails, we still have the model outputs from Step 3. We can retry scoring without re-invoking the content models. 4. Token usage is free metadata — always capture it Bedrock returns inputTokens and outputTokens in every response. Capturing and returning this costs nothing but enables: - Per-customer billing - Cost forecasting - Identifying expensive prompts - Detecting anomalies sudden spike in token usage = possible abuse 5. Start with S3, add a database when you need queries For the POC, S3 handles all storage. It's simple, cheap, and sufficient for sequential read/write patterns. We're adding DynamoDB only now that we need to query experiment history by user — something S3 can't do efficiently. What's Next The platform is functional but evolving: - Selection History — DynamoDB-backed experiment sessions so users can revisit past comparisons and track which model they ultimately chose - Frontend UI — Visual interface for running experiments and browsing history - Cognito Authentication — User-level access control when the UI ships Tech Stack Summary | Layer | Service | Why | |---|---|---| | API | API Gateway HTTP API | Low latency, pay-per-request | | Compute | AWS Lambda Python | Serverless, scales to zero | | Orchestration | Step Functions | Visual workflow, built-in retries | | AI Models | Amazon Bedrock Converse API | Multi-model, unified interface | | Storage | Amazon S3 | Cheap, durable, simple | | Monitoring | CloudWatch + SNS | Billing alarms, email alerts | | Auth planned | API Keys + Cognito | Layered security | | History planned | DynamoDB | Fast queries by user/session | Reach Out to Us Interested in modernizing your cloud infrastructure and building enterprise-grade solutions? Storm Reply is driven by continuous learning and practical innovation. We specialize in designing and delivering scalable AWS architectures that support customers throughout their cloud journey, from early assessment to production-ready deployment. With deep experience in AWS architecture, data engineering, and security best practices, we help enterprises migrate with confidence and move faster on their cloud transformation goals. Let’s connect and explore how we can support your modernization initiatives. 🌐 Website: https://www.stormreply.cloud/ https://www.stormreply.cloud/ 💼 LinkedIn: https://www.linkedin.com/company/storm-reply/posts/?feedView=all https://www.linkedin.com/company/storm-reply/posts/?feedView=all Date: May 2026 The full system runs in eu-central-1 Frankfurt , costs under $20/month excluding Bedrock usage, and handles the entire evaluation lifecycle in a single API call. Serverless means we pay nothing when nobody's running experiments, and scale automatically when they are. If you're building something similar — any system where API calls trigger expensive downstream operations — lock down your API first, validate inputs aggressively, and always know what each request costs. Built with AWS Lambda, Step Functions, and Amazon Bedrock.