{"slug": "build-a-test-suite-that-grows-with-your-agent-with-dataset-management-in-amazon", "title": "Build a test suite that grows with your agent with dataset management in Amazon Bedrock AgentCore", "summary": "Amazon Bedrock AgentCore now supports dataset management for agent evaluation, enabling developers to create versioned test fixtures with inputs, expected outputs, assertions, and tool sequences. The feature allows teams to capture production failures as permanent test cases, iterate on mutable drafts, and publish immutable numbered versions for consistent offline baselines. This brings the discipline of versioned test suites to agent evaluation, addressing the challenge of measuring non-deterministic agent behavior across both developer inner loops and CI/CD pipelines.", "body_md": "[Artificial Intelligence](https://aws.amazon.com/blogs/machine-learning/)\n\n# Build a test suite that grows with your agent with dataset management in Amazon Bedrock AgentCore\n\nAgent evaluation is most powerful when you combine fast-moving online signals with stable offline baselines. To understand whether your agent is truly improving over time, you need a fixed benchmark alongside your changing real-world traffic.\n\nManaging test cases for evaluation baselines as a dataset in Amazon Bedrock AgentCore brings the discipline of versioned test fixtures to agent evaluation. You can author scenarios with inputs, expected outputs, assertions, and tool sequences, then publish them as immutable numbered versions that don’t shift beneath a run. You can iterate freely on a mutable draft until you’re ready to lock a checkpoint. And when something breaks in production, that failure becomes a permanent test case that every future change gets evaluated against.\n\nIn this post, we walk through the full workflow with a financial market-intelligence agent. We capture failures from production traces, build a versioned dataset, run an evaluation, fix the agent, and confirm the improvement against the same locked inputs.\n\n## Why datasets matter\n\nAgents are non-deterministic by design. The same input can produce different outputs across runs, which makes a single evaluation result nearly meaningless. You can’t tell if a score moved because the agent changed or because the model sampled differently. Consistent measurement across stable inputs is the only way to know whether a change actually helped.\n\nBut stable inputs alone aren’t enough. A large language model (LLM) judge can tell you whether a response sounds helpful. It cannot tell you whether the stock price is accurate, whether the broker workflow ran in the right order, or whether personally identifiable information (PII) leaked between sessions. For those checks you need ground truth: the expected response, the required tool sequence, and the assertions that must hold regardless of how the response is phrased. Ground truth is what turns a subjective score into a verifiable measurement. Without it, you’re measuring the appearance of correctness, not correctness itself.\n\nVersioned datasets give you both. They hold the inputs still so scores are comparable across runs, and they carry the ground truth that makes those scores mean something. This matters most in the two places where agent evaluation actually happens.\n\nThe **inner loop** is the developer desk. You invoke the agent, read the scores, adjust a tool description, and invoke again. The cycle is minutes. The problem isn’t running evaluations at this stage, it’s that the test cases tend to be whatever was nearby: questions someone wrote last week or a session you happened to save. When a score improves you want to believe the fix worked. But without stable inputs underneath it, you can’t know if the agent got better or the questions got easier.\n\nThe **outer loop** is the CI/CD pipeline. Before a change ships, something needs to say it didn’t break anything. Most teams have this gate. What they often don’t have is a stable, versioned set of inputs with explicit assertions beneath it. This means the gate is testing whatever someone last pointed it at, with no ground truth to check against. A pipeline that passes a build because the questions changed isn’t catching regressions, it’s missing them.\n\nA versioned dataset closes that gap. The developer curates failures into the draft during the inner loop. In the outer loop, a published version of that draft becomes the gate. It’s immutable, ground truth intact, and tests the same scenarios it tested last sprint and the sprint before that. The score that told a developer the fix worked is the same score the pipeline uses to decide whether it ships.\n\n## Two types of test scenarios\n\nDatasets in Amazon Bedrock AgentCore support two schema types that serve these two loops differently.\n\n**Predefined scenarios** are backward-looking. You have defined the exact queries your user will send to your agent and you know what correct looks like: the expected response, the tool sequence, and the assertions that must hold. You write them down and the evaluator checks whether the agent met them. Once a failure is formalized as a predefined scenario, it stays in every future evaluation run. They belong in the outer loop gate because the pass and fail criteria are explicit, repeatable, and don’t depend on how the conversation went.\n\n**User simulation scenarios** are forward-looking. Instead of scripting turns, you describe a persona: who the actor is, what they want to achieve, and how they communicate. An LLM-backed actor drives a real multi-turn conversation with your agent until the goal is met or the turn limit is reached. You don’t script what the actor says. Coverage emerges from the interaction. For more information, see [User simulation](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/user-simulation.html) in the AgentCore User Guide.\n\nThis is different from anything in the standard evaluation toolkit. A predefined scenario tests whether your agent handles a specific input correctly. A simulated scenario tests whether your agent can satisfy a type of user across whatever path that user takes.\n\nThroughout this post we use a [Market Trends Agent](https://github.com/awslabs/agentcore-samples/tree/main/02-use-cases/market-trends-agent) as the running example. The agent serves investment brokers at financial institutions. A broker messages the agent with something like “I’m Sarah Chen from Morgan Stanley, focused on tech and clean energy — what’s happening with NVDA today?” The agent identifies the broker, stores their preferences in AgentCore Memory, retrieves the current NVIDIA stock price, and searches for relevant news across Bloomberg and Reuters. It then delivers a personalized briefing that connects the data to Sarah’s stated investment focus. When Sarah comes back the next day, the agent remembers her profile and tailors its response accordingly.\n\nFor a predefined scenario, you might have production traces of how a user interacted with your agent, and you can curate them for your evaluation dataset. An example of this for the Market Trends Agent looks like this:\n\nA simulated senior tech analyst might open with a broad question about NVIDIA Corporation Common Stock ([NVDA](https://www.nasdaq.com/market-activity/stocks/nvda)). She pushes back when the response feels thin, asks for a comparison to Advanced Micro Devices, Inc. Common Stock ([AMD](https://www.nasdaq.com/market-activity/stocks/amd)), and only signals completion when she has something citable for a client call. No one scripted those turns. The actor generated them from the profile. You can define this scenario as follows:\n\nUser simulation is particularly useful in the inner loop, when you’re not sure what failure modes you haven’t found yet. The failures that surface become candidates for predefined scenarios in the next dataset version, feeding directly into the outer loop gate.\n\nA few things are worth knowing about how simulation works under the hood. The actor runs on a Bedrock model you specify in `SimulationConfig`\n\n. At each turn, the actor receives the agent’s response and produces three things: its internal reasoning about whether the goal was met, the next message to send, and a stop signal. The conversation ends when the actor signals completion, when `max_turns`\n\nis reached, or when the actor produces no next message. Because the conversation path is dynamic, simulated scenarios don’t support `expected_trajectory`\n\nor per-turn `expected_response`\n\n. Use assertions for ground truth instead, and describe the outcome you expect regardless of how the conversation got there.\n\nFor the Market Trends Agent, a predefined scenario covers the price drift bug that already burned a broker. A simulated scenario covers the environmental, social, and governance (ESG) specialist who hasn’t surfaced a failure yet but represents a real user type the agent needs to handle well.\n\n## How datasets in AgentCore work\n\n[Datasets](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/datasets-manage.html) are built into AgentCore as a first-class resource with ARNs, IAM authorization, and tags. There are no Amazon Simple Storage Service (Amazon S3) buckets to provision or external services to configure.\n\n**Draft and publish.** Every dataset has one mutable draft where you add and remove scenarios freely. When you want a stable checkpoint, publish it. The draft becomes an immutable numbered version. Pin an evaluation run to Version 3, and it will use the exact scenarios that were in Version 3, regardless of what you’ve added to the draft since.**Schema validation at write time.** You declare a schema type when you create the dataset, and every scenario is validated against that schema before it’s accepted. Malformed examples are rejected at ingest rather than surfacing as errors halfway through a 30-minute evaluation run.**One dataset, multiple runners.** Load a dataset with`DatasetManagementServiceProvider`\n\nand pass it to either the on-demand runner for fast per-scenario feedback, or the batch runner for aggregate scoring across many sessions. The same scenarios, assertions, and dataset ID apply whether you’re iterating at your desk or gating a deployment.\n\n## The agent: Market Trends Assistant\n\nWe use the [Market Trends Agent](https://github.com/awslabs/agentcore-samples/tree/main/02-use-cases/market-trends-agent), a LangGraph application deployed on AgentCore Runtime, as the running example. The full source is available in the AgentCore samples repository under `02-use-cases/market-trends-agent`\n\n. The agent has the following tools that help it serve queries from financial brokers:\n\nTool |\nWhat it does |\n`get_stock_data` |\nCurrent price, daily change, volume for a ticker. |\n`search_news` |\nMulti-source financial news search (Bloomberg, Reuters, CNBC, WSJ, FT). |\n`identify_broker` |\nExtracts broker identity from the message for memory lookup. |\n`get_broker_financial_profile` |\nReads stored preferences, risk tolerance, sector focus. |\n`update_broker_financial_interests` |\nWrites new preferences to long-term memory. |\n\nThree failure modes come up often enough to warrant permanent test cases:\n\n**Stale prices**— the agent quotes a number that’s drifted more than 2% from the live value, usually because it reused a cached tool response rather than making a fresh call.**Skipped identity check**— the agent jumps straight to`get_broker_financial_profile`\n\nwithout calling`identify_broker`\n\nfirst, which can result in pulling the wrong broker’s preferences and delivering a response tailored to someone else’s portfolio.**PII bleed**— personally identifiable information from one broker’s profile leaks into a response to a different session, typically when session boundaries aren’t respected in the memory layer.\n\nEach of these failures is subtle enough to pass a manual spot check but serious enough to erode trust with the brokers who depend on the agent daily.\n\n## Implementation\n\nThis hands-on walkthrough takes about 30 minutes.\n\n### Prerequisites\n\nYou need the following:\n\n- An AWS account with permissions for AgentCore Runtime, Memory, Evaluations, and Amazon CloudWatch.\n- AWS Command Line Interface (AWS CLI) configured.\n- CloudWatch Transaction Search enabled (one-time account opt-in).\n- The samples repo cloned and the Market Trends Agent deployed (\n`uv run python deploy.py`\n\n).\n\nThe full sample is available in the `02-use-cases/market-trends-agent`\n\ndirectory of the [AgentCore samples repository](https://github.com/aws-samples/amazon-bedrock-agentcore-samples).\n\n### Walkthrough\n\n**Deploy the Market Trends Agent.** Run`uv run python deploy.py`\n\nto provision the AgentCore Runtime, Memory, IAM role, and ECR container. The agent ARN is written to`.agent_arn`\n\n.**Create and version evaluation datasets.** Run`uv run python optimization/manage_dataset.py --no-cleanup`\n\nto create two datasets and publish an immutable version of each.The**predefined dataset** includes five scripted test cases covering the agent’s core failure modes: broker onboarding, stock data retrieval, multi-turn profile followed by news, memory recall for a returning broker, and a PII safety check. The PII case uses a fabricated SSN in the user’s message that should never appear in the response.The**simulated dataset** includes three actor-profile scenarios. The first is a senior tech broker who needs a momentum briefing before a client call. The second is an ESG specialist reviewing portfolio alignment. The third is a dividend-focused investor screening for income opportunities. Each actor drives its own conversation without scripted turns.The script also demonstrates the day-to-day curation workflow: adding new examples, updating existing ones, and deleting stale cases before publishing.**Run evaluation against the versioned dataset.** Run`uv run python optimization/user_simulated_dataset.py`\n\nto load the simulated scenarios, invoke the agent against each one, and wait for spans to land in CloudWatch. The script then submits a batch evaluation with Correctness, Helpfulness, and GoalSuccessRate. Per-scenario scores and explanations print to the console.**Iterate: fix the agent and re-evaluate.** Update the tool description or system prompt based on evaluation explanations. Add the newly surfaced edge case to the draft with`add_examples_and_wait()`\n\n, publish a new version with`create_dataset_version_and_wait()`\n\n, and re-run. Because the scenarios and assertions are identical between runs, the before and after comparison isolates the effect of your change. A Correctness improvement on the price drift scenario now means something: the same input was tested both times. Alternately, you can perform this iteration of improving the agent using recommendations from AgentCore directly which uses a evaluator as a signal and provides suggestions on improving the system prompt and tool descriptions of an agent, as demonstrated by the[optimize_agent.py](https://github.com/awslabs/agentcore-samples/blob/main/02-use-cases/market-trends-agent/optimization/optimize_agent.py)in the Market Trends Agent sample.**View results.** Scores show in the AgentCore Observability console and in a dedicated CloudWatch log group. The explanation field tells you why a scenario passed or failed: “`identify_broker`\n\nwas never called before`get_broker_financial_profile`\n\n” or “agent response contained SSN pattern matching PII safety assertion.”\n\nAfter completing these steps, the dataset persists as a managed resource in your AWS account. Future evaluation jobs can reference the same dataset ID and version, whether triggered from a developer’s machine, a CI/CD pipeline, or a scheduled regression check.\n\n## Using the dataset across your workflow\n\nMode |\nUse case |\nWhat happens |\n| On-demand runner | Development, CI/CD gates | SDK invokes the agent, collects spans, scores immediately, returns per-scenario detail. |\n| Batch runner | Baselines, large-scale comparison | Service handles scoring asynchronously, writes aggregate results to CloudWatch. |\n\nUse the on-demand runner when you need immediate per-scenario feedback during development on smaller datasets when you manage the concurrency. Use the batch runner when measuring aggregate quality across a large dataset or comparing two agent versions at scale.\n\nFor deployment gates, pin the evaluation to a published version and fail the build if any evaluator drops below your threshold. Because the version is immutable, the gate tests the same scenarios on every PR regardless of what’s been added to the draft since.\n\nThe same versioned dataset also feeds into AgentCore Optimization. When you use the Recommendations API to generate improved system prompts or tool descriptions, the evaluation scores driving those decisions are grounded in your dataset. The same applies when you set up an A/B test to validate those improvements against live traffic. Stable inputs make the optimization loop trustworthy rather than a side effect of a shifting test set.\n\n## Practices worth adopting early\n\n**Ground your test suite in real incidents.** The scenarios that catch the most are the ones sourced from actual production failures. A broker received a stale price, a profile got crossed between sessions, or PII appeared where it shouldn’t. These target weaknesses your agent has already demonstrated in front of real users. Invented questions target imagined ones.\n\n**Use predefined for depth, simulated for breadth.** Predefined scenarios guard the bugs you’ve already found. Simulated scenarios surface the ones you haven’t. A healthy dataset includes both. For the Market Trends Agent, predefined scenarios cover the pricing drift and workflow-order bugs that already burned a user. Simulated scenarios exercise varied broker personas that push the agent in directions no one anticipated.\n\n**Publish a version before every change.** Versions are immutable and cost nothing to keep. When you’re debugging a score regression months later, you’ll want to know exactly which scenarios were in play at each checkpoint.\n\n**One dataset, many versions.** Resist creating a new dataset every sprint. The value is in continuity. The same dataset ID accumulates every failure your agent has ever encountered, and every future evaluation inherits that history. Publishing a new version means building on everything you’ve already learned. Creating a new dataset means starting from scratch.\n\n## Cleanup\n\nTo avoid incurring future charges, delete the dataset and its versions with `DatasetClient.delete_dataset()`\n\n. Remove the Market Trends Agent by following the cleanup section in the repository README.\n\n## Conclusion\n\nA test suite is only useful if it holds still. When the inputs change between runs, your scores measure drift in the test set rather than improvement in the agent.\n\nManaging datasets in AgentCore gives you versioned, schema-validated test cases as a managed resource. Production failures become permanent regression scenarios, simulated personas generate coverage you couldn’t script by hand, and immutable versions let you compare honestly across agent releases.\n\nThe pricing bug that a broker caught last quarter is a test case now. Every change to the system prompt, every tool description update, and every model swap gets evaluated against it. The suite accumulates institutional knowledge about how your agent has failed, and it holds every future version accountable to that history.\n\nTo get started, see the [Amazon Bedrock AgentCore documentation](https://docs.aws.amazon.com/bedrock-agentcore/) and the [Market Trends Agent sample](https://github.com/aws-samples/amazon-bedrock-agentcore-samples/tree/main/02-use-cases/market-trends-agent).", "url": "https://wpnews.pro/news/build-a-test-suite-that-grows-with-your-agent-with-dataset-management-in-amazon", "canonical_source": "https://aws.amazon.com/blogs/machine-learning/build-a-test-suite-that-grows-with-your-agent-with-dataset-management-in-amazon-bedrock-agentcore/", "published_at": "2026-05-28 18:10:47+00:00", "updated_at": "2026-05-28 18:32:12.220572+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "ai-agents", "ai-tools", "ai-products"], "entities": ["Amazon Bedrock AgentCore", "Amazon Web Services"], "alternates": {"html": "https://wpnews.pro/news/build-a-test-suite-that-grows-with-your-agent-with-dataset-management-in-amazon", "markdown": "https://wpnews.pro/news/build-a-test-suite-that-grows-with-your-agent-with-dataset-management-in-amazon.md", "text": "https://wpnews.pro/news/build-a-test-suite-that-grows-with-your-agent-with-dataset-management-in-amazon.txt", "jsonld": "https://wpnews.pro/news/build-a-test-suite-that-grows-with-your-agent-with-dataset-management-in-amazon.jsonld"}}