{"slug": "building-a-serverless-ai-model-evaluation-platform-on-aws", "title": "Building a Serverless AI Model Evaluation Platform on AWS", "summary": "Serverless AI model evaluation platform built on AWS by a media company to automatically compare podcast-style summaries generated by different foundation models. The system uses AWS Step Functions to orchestrate six Lambda functions that validate inputs, invoke multiple Bedrock models in parallel, score outputs with a separate AI judge, and produce an HTML comparison report—all triggered by a single API call. Key design choices include parallel model invocations within a single Lambda to reduce costs and wall-clock time, and using a separate scoring model to avoid self-evaluation bias.", "body_md": "## The Problem\n\nA media company needed to evaluate which AI model produces the best podcast-style summaries from news articles. They wanted to:\n\n- Send an article to multiple AI models simultaneously\n- Compare the outputs side by side\n- Score each output automatically\n- Generate a visual comparison report\n\nDoing this manually, copying articles into different model playgrounds, reading outputs, judging quality, doesn't scale. They needed an automated evaluation pipeline that could run experiments on demand and produce consistent, comparable results.\n\n## What We Built\n\nA fully serverless evaluation platform on AWS that accepts an article, runs it through multiple foundation models in parallel, scores each output using a separate AI judge, and produces an HTML comparison report. All triggered by a single API call.\n\nThe system handles the entire lifecycle:\n\n-\n**Prompt optimization**— an AI agent refines the user's instructions into an effective prompt -** Parallel model invocation**— multiple Bedrock models generate summaries simultaneously -** Automated scoring**— a scoring agent evaluates each output against quality criteria -** Report generation**— produces a formatted HTML comparison page\n\n## Architecture Overview\n\n## The 6-Step Workflow\n\nThe core of the system is a Step Functions state machine that orchestrates six Lambda functions in sequence. Here's what each step does and why it exists as a separate step.\n\n### Step 1: Validate\n\n``` python\ndef validate(event):\n    \"\"\"Read and validate the experiment definition from S3.\"\"\"\n    definition = s3.get_object(Bucket=BUCKET, Key=f\"definitions/{experiment_id}/definition.json\")\n    # Validate required fields: article, models, prompt\n    # Fail fast if inputs are malformed\n    return validated_definition\n```\n\nWhy a separate step? Fail-fast validation before incurring any Bedrock costs. If the definition is malformed, we stop here — no wasted model invocations.\n\n### Step 2: Invoke Models (Parallel)\n\nThis is where it gets interesting. We invoke multiple Bedrock models simultaneously using Python's `ThreadPoolExecutor`\n\n:\n\n``` python\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\n\ndef invoke_models(definition):\n    models = definition['models']  # e.g., [\"meta.llama3-70b\", \"deepseek-r1\", \"amazon.nova-lite\"]\n    prompt = definition['prompt']\n    article = definition['article']\n\n    results = {}\n\n    with ThreadPoolExecutor(max_workers=len(models)) as executor:\n        futures = {\n            executor.submit(invoke_bedrock, model_id, prompt, article): model_id\n            for model_id in models\n        }\n        for future in as_completed(futures):\n            model_id = futures[future]\n            response = future.result()\n            results[model_id] = {\n                \"output\": response['output']['message']['content'][0]['text'],\n                \"usage\": {\n                    \"input_tokens\": response['usage']['inputTokens'],\n                    \"output_tokens\": response['usage']['outputTokens']\n                }\n            }\n\n    return results\n```\n\nWhy ThreadPoolExecutor inside Lambda? Bedrock API calls are I/O-bound. Running them in parallel within a single Lambda invocation means we pay for one Lambda execution instead of three, and the total wall-clock time is roughly equal to the slowest model rather than the sum of all models.\n\n### Step 3: Store Outputs\n\nWrites `comparison.json`\n\nto S3 — containing all model outputs but no scores yet. This creates a checkpoint: if scoring fails, we don't lose the generated content.\n\n### Step 4: Score (Parallel)\n\nThe scoring agent (Claude Haiku) evaluates each model's output against quality criteria. Again, parallel execution via ThreadPoolExecutor:\n\n``` python\ndef score(outputs):\n    scoring_prompt = \"\"\"Rate this podcast summary on:\n    - Accuracy (1-10): Does it faithfully represent the article?\n    - Engagement (1-10): Would a listener find this compelling?\n    - Structure (1-10): Is it well-organized for audio?\n    Respond with JSON only.\"\"\"\n\n    with ThreadPoolExecutor(max_workers=len(outputs)) as executor:\n        futures = {\n            executor.submit(invoke_bedrock, SCORING_MODEL, scoring_prompt, output): model_id\n            for model_id, output in outputs.items()\n        }\n        # ... collect scores\n```\n\nWhy a separate scoring model? Using a different model (or at minimum, a separate invocation with a scoring-specific prompt) as the judge avoids self-evaluation bias. The scoring agent doesn't know which model produced which output.\n\n### Step 5: Store Scores\n\nUpdates `comparison.json`\n\nwith the scores attached to each model's output.\n\n### Step 6: Generate HTML\n\nProduces a formatted `comparison.html`\n\nreport that displays all outputs side by side with their scores. This is the final deliverable the user downloads.\n\n## Why Amazon Bedrock's Converse API?\n\nWe use the [Converse API](https://docs.aws.amazon.com/bedrock/latest/userguide/conversation-inference.html) rather than the model-specific `InvokeModel`\n\nAPI. The key advantage:**one unified interface across all models**.\n\n``` python\ndef invoke_bedrock(model_id, system_prompt, user_message):\n    response = bedrock_runtime.converse(\n        modelId=model_id,\n        messages=[{\"role\": \"user\", \"content\": [{\"text\": user_message}]}],\n        system=[{\"text\": system_prompt}]\n    )\n    return response\n```\n\nSwitching from Llama to Claude to Nova Lite requires changing only the `model_id`\n\nstring. No code changes, no different request formats, no response parsing differences.\n\nThe Converse API also returns token usage in every response — which we pass through to the caller for billing:\n\n```\n{\n  \"results\": [\n    {\n      \"model_id\": \"meta.llama3-70b-instruct-v1:0\",\n      \"summary\": \"...\",\n      \"usage\": { \"input_tokens\": 1523, \"output_tokens\": 847 }\n    }\n  ],\n  \"total_usage\": { \"total_input_tokens\": 4569, \"total_output_tokens\": 2541 }\n}\n```\n\n## Cost Control: The Hardest Part\n\nHere's the reality of building on top of foundation models:**every API call costs money**, and costs scale with input size. A single `/run`\n\nrequest invoking 3 models on a long article can cost $0.10–0.50. That sounds small until someone writes a script that calls it in a loop.\n\n### Billing Alarms (Day 1)\n\nWe set up CloudWatch billing alarms immediately:\n\n```\nCloudWatch Alarm ($10 threshold) → SNS → Email notification\nCloudWatch Alarm ($25 threshold) → SNS → Email notification\n```\n\nThis is the bare minimum. You'll know when costs are climbing, even if you can't stop them automatically.\n\n### API Security (Critical for Any AI-Backed API)\n\nAn unprotected API that invokes foundation models is essentially a public credit card. We learned this the hard way and now treat API security as P0 — before any external access:\n\n-**API Keys** on every endpoint (immediate protection) -**Usage plans** with per-key quotas (500 requests/day, 5000/month) -**Rate limiting**(10 req/s throttle) to prevent burst abuse -** Request logging** to attribute usage to specific callers\n\n```\n# Every request must include the API key\ncurl -X POST https://api.example.com/run \\\n  -H \"x-api-key: btk_live_abc123def456\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"article\": \"...\", \"models\": [\"meta.llama3-70b\"]}'\n```\n\nWithout this, anyone who discovers your API URL can generate unbounded Bedrock charges.\n\n## Lessons Learned\n\n### 1. Separate validation from execution\n\nBedrock calls are expensive. Validate everything before invoking any model. Check that the article isn't empty, the model IDs are valid, the prompt isn't too long. Fail at Step 1, not Step 2.\n\n### 2. ThreadPoolExecutor > separate Lambda invocations for parallel model calls\n\nWe considered using Step Functions' native parallel states or invoking separate Lambdas per model. ThreadPoolExecutor within a single Lambda turned out simpler:\n\n- One Lambda execution to pay for (not N)\n- Shared memory for the article text (no repeated S3 reads)\n- Simpler error handling\n- Total time ≈ slowest model, not sum of all\n\nThe tradeoff: if one model times out, the entire Lambda times out. We mitigate this with per-future timeouts.\n\n### 3. Store intermediate results\n\nEach step writes to S3 before the next step begins. If Step 4 (scoring) fails, we still have the model outputs from Step 3. We can retry scoring without re-invoking the content models.\n\n### 4. Token usage is free metadata — always capture it\n\nBedrock returns `inputTokens`\n\nand `outputTokens`\n\nin every response. Capturing and returning this costs nothing but enables:\n\n- Per-customer billing\n- Cost forecasting\n- Identifying expensive prompts\n- Detecting anomalies (sudden spike in token usage = possible abuse)\n\n### 5. Start with S3, add a database when you need queries\n\nFor the POC, S3 handles all storage. It's simple, cheap, and sufficient for sequential read/write patterns. We're adding DynamoDB only now that we need to query experiment history by user — something S3 can't do efficiently.\n\n## What's Next\n\nThe platform is functional but evolving:\n\n-**Selection History**— DynamoDB-backed experiment sessions so users can revisit past comparisons and track which model they ultimately chose -**Frontend UI**— Visual interface for running experiments and browsing history -** Cognito Authentication**— User-level access control when the UI ships\n\n## Tech Stack Summary\n\n| Layer | Service | Why |\n|---|---|---|\n| API | API Gateway (HTTP API) | Low latency, pay-per-request |\n| Compute | AWS Lambda (Python) | Serverless, scales to zero |\n| Orchestration | Step Functions | Visual workflow, built-in retries |\n| AI Models | Amazon Bedrock (Converse API) | Multi-model, unified interface |\n| Storage | Amazon S3 | Cheap, durable, simple |\n| Monitoring | CloudWatch + SNS | Billing alarms, email alerts |\n| Auth (planned) | API Keys + Cognito | Layered security |\n| History (planned) | DynamoDB | Fast queries by user/session |\n\n## Reach Out to Us\n\nInterested in modernizing your cloud infrastructure and building enterprise-grade solutions?**Storm Reply** is driven by continuous learning and practical innovation. We specialize in designing and delivering scalable AWS architectures that support customers throughout their cloud journey, from early assessment to production-ready deployment.\n\nWith deep experience in AWS architecture, data engineering, and security best practices, we help enterprises migrate with confidence and move faster on their cloud transformation goals.\n\nLet’s connect and explore how we can support your modernization initiatives.\n\n🌐**Website:**[https://www.stormreply.cloud/](https://www.stormreply.cloud/)\n\n💼**LinkedIn:**[https://www.linkedin.com/company/storm-reply/posts/?feedView=all](https://www.linkedin.com/company/storm-reply/posts/?feedView=all)**Date:** May 2026\n\nThe full system runs in `eu-central-1`\n\n(Frankfurt), costs under $20/month excluding Bedrock usage, and handles the entire evaluation lifecycle in a single API call. Serverless means we pay nothing when nobody's running experiments, and scale automatically when they are.\n\nIf you're building something similar — any system where API calls trigger expensive downstream operations — lock down your API first, validate inputs aggressively, and always know what each request costs.\n\n*Built with AWS Lambda, Step Functions, and Amazon Bedrock.*", "url": "https://wpnews.pro/news/building-a-serverless-ai-model-evaluation-platform-on-aws", "canonical_source": "https://dev.to/debapriya_dey_aada54b7766/building-a-serverless-ai-model-evaluation-platform-on-aws-4d47", "published_at": "2026-05-22 07:23:38+00:00", "updated_at": "2026-05-22 07:32:03.074864+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "cloud-computing", "developer-tools"], "entities": ["AWS", "Step Functions", "Lambda", "Bedrock", "ThreadPoolExecutor"], "alternates": {"html": "https://wpnews.pro/news/building-a-serverless-ai-model-evaluation-platform-on-aws", "markdown": "https://wpnews.pro/news/building-a-serverless-ai-model-evaluation-platform-on-aws.md", "text": "https://wpnews.pro/news/building-a-serverless-ai-model-evaluation-platform-on-aws.txt", "jsonld": "https://wpnews.pro/news/building-a-serverless-ai-model-evaluation-platform-on-aws.jsonld"}}