Building an AI Model Evaluation Pipeline on AWS for Audio Content Generation

A European digital media publisher built a serverless evaluation pipeline on AWS to determine which foundation model on Amazon Bedrock produces the highest-quality podcast-style summaries from news articles. The pipeline uses an LLM-as-Judge approach with a strict scoring rubric to compare multiple models in parallel, generating actionable reports for both technical and editorial teams without requiring AWS console access. The proof of concept focuses on creating a repeatable, data-driven framework for model selection rather than a production audio generation system, addressing challenges like hallucination risk and variable output quality.

A European digital media publisher needed to determine which foundation model on Amazon Bedrock produces the highest-quality podcast-style summaries from news articles. Rather than selecting a model based on general benchmarks, they built a serverless evaluation pipeline on AWS that runs structured experiments — comparing multiple models in parallel, scoring outputs with an LLM-as-Judge approach, and delivering actionable insights to both technical and editorial teams. This post describes the business drivers, architectural approach, evaluation methodology, and outcomes of the proof of concept PoC , built entirely on AWS-native services. The customer is a digital media publisher experiencing declining engagement as user consumption shifts toward flexible, audio-first formats. Their strategic objective is to evolve from traditional text delivery into personalized, AI-driven audio experiences — such as user-specific podcast-style summaries generated from their existing article library. This initiative is expected to: The technical challenge: foundation models produce highly variable output quality depending on the model, prompt strategy, and content type. Selecting the wrong model risks hallucinated facts unacceptable for a news publisher , poor audio readability, or unsustainable cost per article. The customer needed a data-driven approach to model selection — not a one-off playground test, but a repeatable evaluation framework that could inform decisions across formats, topics, and evolving model capabilities. The PoC focused on building an evaluation and experimentation pipeline — not the production audio generation system itself. The goal was to enable structured, repeatable testing of multiple foundation models and prompt strategies for summarization and script generation. The evaluation pipeline is fully serverless, deployed via Terraform, and designed around the principle of experiment-as-configuration — each evaluation run is defined by a JSON document specifying models, prompts, inputs, and scoring criteria. The user submits an article and selects a scenario type interview, monologue, debate, short summary . A Prompt Agent powered by Claude Haiku generates an optimized instruction prompt tailored to the article topic and requested format. The user reviews and optionally edits the prompt before execution. This human-in-the-loop step prevents wasted model invocations on suboptimal prompts. AWS Step Functions triggers the evaluation workflow. The Invoker Lambda uses Python's ThreadPoolExecutor to invoke 2-5 Bedrock models simultaneously via the Converse API — a unified interface that eliminates provider-specific request/response handling. Results are written to S3 progressively as each model completes, enabling the frontend to display partial results without waiting for the slowest model. A separate Claude Haiku instance evaluates each model's output against the source text using a strict rubric. Five dimensions are scored on a 0-100 scale: The rubric is deliberately strict: scores of 96-100 are "almost never given," and most solid outputs land in the 61-80 range. This forces meaningful differentiation between models. The pipeline generates a self-contained HTML comparison report with: Editorial stakeholders can view reports via presigned S3 URLs without requiring AWS console access. The approval workflow saves the selected output for downstream use. The Converse API provides a unified interface across all foundation models on Bedrock. Adding a new model to the evaluation requires only a configuration change — no code modifications for request formatting or response parsing. This is critical for an evaluation platform where the set of models under test changes frequently. Traditional NLP metrics ROUGE, BLEU measure surface-level text similarity. They cannot evaluate: An LLM-as-Judge captures these subjective quality dimensions that matter most to editorial teams. The strict rubric ensures scoring consistency across experiments. The cost of running 100 evaluation experiments $15-50 in Bedrock usage is negligible compared to the cost of building a production system on the wrong model and discovering quality issues after launch. The evaluation pipeline de-risks the model selection decision and creates a reusable framework for ongoing optimization. The platform supports two complementary evaluation approaches: A separate foundation model evaluates outputs against source text and prompt instructions. Scoring features are configurable per experiment — teams can define custom dimensions relevant to their use case. The scoring agent receives: It returns a JSON object with integer scores per dimension, which are validated against the defined scale ranges. Where available, the platform also leverages Bedrock's built-in evaluation API for standardized metrics: These provide a baseline that complements the more nuanced LLM-as-Judge scores. An area under exploration is the ability to define and register custom evaluation metrics programmatically — for example, an "audio readability" metric that specifically penalizes text patterns that sound unnatural in text-to-speech synthesis. After running structured experiments across multiple articles, models, and prompt strategies: Model quality varies significantly by format. A model that excels at short-form summaries may produce awkward multi-speaker scripts. Format-specific evaluation is essential — there is no single "best model" across all use cases. Prompt engineering impact often exceeds model selection impact. The quality difference between a well-crafted prompt and a generic one frequently exceeds the difference between models. The Prompt Agent + human review loop captures this value early. Hallucination rates correlate with topic complexity. Simple event reporting is handled well by all tested models. Complex topics with nuance scientific findings, policy debates show significantly higher hallucination variance. Scoring consistency requires explicit rubric design. Without strict guidelines, the AI judge assigns uniformly high scores. The calibrated rubric forces differentiation that maps to real editorial quality differences. The serverless architecture means infrastructure costs are near-zero when the platform is idle. The primary cost driver is Bedrock model invocations — directly proportional to experiment volume and controllable via API rate limiting and usage quotas. This PoC is phase one of a broader initiative: The customer is evaluating AWS as the long-term platform to support end-to-end content generation workflows — from article ingestion through summarization, text-to-speech synthesis, and media distribution. The customer's next ambition is to automatically generate daily short-form video content — 30-60 second clips summarizing top stories in a format optimized for TikTok, Instagram Reels, and YouTube Shorts. This requires chaining multiple AI capabilities: The evaluation pipeline built in Phase 1 directly informs this: the model and prompt strategy selected for summarization quality will power the script generation layer of the video pipeline. The same LLM-as-Judge framework can evaluate script quality for the shorter, punchier format that short-form video demands. The customer is seeking a continued partnership with AWS to build this end-to-end workflow — from article ingestion through AI summarization, speech synthesis, and automated video publishing. This represents a multi-phase engagement with growing AWS service consumption across Bedrock, Polly, S3, MediaConvert, and potentially SageMaker for fine-tuning. Foundation model selection for production use cases requires more than benchmark comparisons. It requires evaluation on your actual content, with your actual quality criteria, at sufficient scale to surface meaningful differences. By building the evaluation infrastructure first — using Amazon Bedrock for model access, AWS Step Functions for orchestration, and an LLM-as-Judge approach for quality scoring — this customer established a repeatable, cost-effective framework for making model decisions with confidence. The serverless architecture ensures the platform costs nothing at rest, scales automatically during experimentation, and deploys in minutes via Terraform. The evaluation framework will continue to serve as new models become available on Amazon Bedrock and as the customer expands into new content formats. Funding this initiative accelerates experimentation on Amazon Bedrock, reduces decision risk around model selection, and establishes a reusable evaluation framework that supports not just the current summarization use case — but the entire roadmap from podcast generation through TikTok-style video automation. The customer's growing ambition maps directly to expanding AWS service adoption. Amazon Bedrock · AWS Step Functions · AWS Lambda · Amazon API Gateway · Amazon S3 · Amazon Cognito · Amazon CloudWatch · Amazon SNS · Terraform Infrastructure as Code This solution was built by Storm Reply as part of an AWS-funded proof of concept for AI-driven content generation.