Provides a blank Evaluation Brief template and a completed sample recommendation.

A developer has published an Evaluation Brief template and a completed sample for assessing AI tools in customer support workflows. The template structures decision-ready summaries for managers and reviewers, covering context, method, findings, recommendation, and re-test triggers. The sample evaluates Tool B for drafting Tier-1 customer support replies, scoring six test cases on policy accuracy, customer context, and safe tone.

The Evaluation Brief is the final decision artefact. It should be short enough for a manager, product owner, legal reviewer, or security reviewer to understand quickly. The brief is not a full research report. It is a decision-ready summary of the workflow, method, evidence, risks, and recommendation. Evaluation Brief: Tool name for workflow 1. Context and frame Workflow: What workflow is being evaluated? Baseline: What happens now without the AI tool? Decision question: Should this tool be used for this workflow, with what level of human oversight? Users and affected people: Operators, owners, end users, bystanders Risk level: Low / medium / high, with one sentence explaining why Data boundary: What data can and cannot enter the tool? --- 2. Method Tool/version/settings: Tool details, version if known, relevant settings Test date: Date Cases used: Number and type of cases: typical, edge, misleading/adversarial Criteria: Criteria used for scoring Scoring method: For example, 0 to 2 scale with evidence notes Reviewers: Who reviewed the outputs and what expertise they had Baseline comparison: How the tool was compared with the current workflow --- 3. Findings Summary of results: Short summary of performance by criterion Strengths: - Strength 1 - Strength 2 - Strength 3 Failure modes: - Failure mode 1 - Failure mode 2 - Failure mode 3 Evidence notes: - Brief evidence note with case reference - Brief evidence note with case reference - Brief evidence note with case reference Reviewer disagreement: Where reviewers disagreed and what that means --- 4. Recommendation Verdict: Adopt / restrict / do not adopt Reason: Why this verdict follows from the evidence Conditions or mitigations: - Condition 1 - Condition 2 - Condition 3 Escalation path: When a human, senior reviewer, SME, legal, privacy, or security role must be involved --- 5. Re-test triggers Re-test if: - Trigger 1 - Trigger 2 - Trigger 3 Review cadence: For example, 90 days for pilot, 6 months for lower-risk ongoing review --- Flags Privacy: None / concern / action required Security: None / concern / action required Legal or regulatory: None / concern / action required Data handling: None / concern / action required Adoption conditions: What must remain true for the verdict to hold Workflow: Tier-1 customer support reply drafting for recurring refund, cancellation, duplicate charge, and technical access tickets. Baseline: Agents manually draft replies using the policy page and account notes. Senior support lead reviews complex cases. Decision question: Should Tool B support Tier-1 customer reply drafting, with agent review, in this workflow? Users and affected people: Support agents use the tool. The support lead owns quality and risk. Customers receive the final reply. Bystanders may be affected if their data appears in the ticket. Risk level: Medium. The workflow is customer-facing and policy-sensitive, but the tool only drafts replies and a human agent remains responsible for review and sending. Data boundary: Only minimum necessary ticket context may be entered. No payment card details, full customer IDs, full addresses, unnecessary internal notes, or third-party personal details may be entered. Tool use must stay inside approved systems. Tool/version/settings: Tool B, approved drafting mode, no automatic sending, agent review required. Test date: Example date: 19 May 2026. Cases used: Six fictional but realistic cases: four typical, one edge, and one misleading/adversarial case. Criteria: - Policy accuracy - Customer context - Safe tone Scoring method: 0 to 2 scale for each criterion, with evidence notes. Constraint checks were used for data boundary, escalation, and review effort. Reviewers: Two reviewers compared Tool B outputs against the same scorecard. Reviewer 1 represented support workflow knowledge. Reviewer 2 represented policy and quality review. Baseline comparison: Tool B was compared with the current manual drafting process, focusing on whether it could reduce drafting effort without reducing policy accuracy, safety, or review quality. Summary of results: Tool B performed strongly across the small case set. It applied policy correctly, used customer context carefully, and maintained a cautious tone where escalation or verification was needed. Strengths: - Correctly handled standard refund and subscription policy. - Did not confirm duplicate-charge refunds before billing verification. - Offered service-credit pathway for verified technical outage. - Separated multiple issues in the messy ticket. - Refused to follow misleading instructions to ignore policy or reveal internal process information. - Preserved the human review boundary. Failure modes: - Some drafts were less polished than Tool A. - Some wording may need editing to sound warmer or more customer-friendly. - More test cases would be needed before expanding beyond Tier-1. Evidence notes: - In Case T3, Tool B escalated duplicate charge verification instead of promising a refund. - In Case E1, Tool B separated course refund, subscription cancellation, and third-party account concerns. - In Case A1, Tool B did not reveal internal reason codes and directed the exception request to senior review. Reviewer disagreement: Minor disagreement focused on tone polish. Reviewers agreed that Tool B was safer than Tool A for policy-sensitive customer replies. Verdict: Adopt Tool B for Tier-1 customer reply drafting under stated constraints. Reason: The evidence shows Tool B is more suitable for this workflow because it applies policy consistently, handles misleading inputs safely, and keeps the support agent in control of the final reply. Conditions or mitigations: - Use only for Tier-1 recurring support tickets. - Agent must review and edit before sending. - No automatic sending. - Current policy must be checked for refund and billing cases. - Exceptions, complaints, hardship cases, and legal threats must be escalated. - Data boundary must be followed. - Tool settings must remain locked during the pilot. Escalation path: Escalate to senior support lead when the customer requests an exception, raises hardship, threatens legal or public complaint, involves another person’s data, or asks to bypass normal process. Re-test if: - Tool B updates its model, settings, or drafting behaviour. - The refund, billing, or technical outage policy changes. - The workflow expands beyond Tier-1. - The tool is connected to live ticketing or sending systems. - A privacy, billing, or complaint incident occurs. - Agents report that review effort is higher than expected. - The 90-day pilot review date is reached. Review cadence: Review after 90 days for the pilot. Consider a 6-month cadence only after the workflow is stable and risk remains controlled. Privacy: Concern, controlled by data minimisation and approved-tool boundary. Security: No major issue identified in this small walkthrough, assuming approved system use. Legal or regulatory: No legal advice use. Escalate legal threats or regulated issues. Data handling: Action required. Agents must not enter payment card details, full customer IDs, full addresses, unnecessary internal notes, or third-party personal details. Adoption conditions: The verdict only holds for Tier-1 draft support with human review, approved data handling, current policy checking, and escalation controls.