The Evaluation Brief is the final decision artefact.
It should be short enough for a manager, product owner, legal reviewer, or security reviewer to understand quickly. The brief is not a full research report. It is a decision-ready summary of the workflow, method, evidence, risks, and recommendation.
## 1. Context and frame
**Workflow:**
[What workflow is being evaluated?]
**Baseline:**
[What happens now without the AI tool?]
**Decision question:**
[Should this tool be used for this workflow, with what level of human oversight?]
**Users and affected people:**
[Operators, owners, end users, bystanders]
**Risk level:**
[Low / medium / high, with one sentence explaining why]
**Data boundary:**
[What data can and cannot enter the tool?]
---
## 2. Method
**Tool/version/settings:**
[Tool details, version if known, relevant settings]
**Test date:**
[Date]
**Cases used:**
[Number and type of cases: typical, edge, misleading/adversarial]
**Criteria:**
[Criteria used for scoring]
**Scoring method:**
[For example, 0 to 2 scale with evidence notes]
**Reviewers:**
[Who reviewed the outputs and what expertise they had]
**Baseline comparison:**
[How the tool was compared with the current workflow]
---
## 3. Findings
**Summary of results:**
[Short summary of performance by criterion]
**Strengths:**
- [Strength 1]
- [Strength 2]
- [Strength 3]
**Failure modes:**
- [Failure mode 1]
- [Failure mode 2]
- [Failure mode 3]
**Evidence notes:**
- [Brief evidence note with case reference]
- [Brief evidence note with case reference]
- [Brief evidence note with case reference]
**Reviewer disagreement:**
[Where reviewers disagreed and what that means]
---
## 4. Recommendation
**Verdict:**
[Adopt / restrict / do not adopt]
**Reason:**
[Why this verdict follows from the evidence]
**Conditions or mitigations:**
- [Condition 1]
- [Condition 2]
- [Condition 3]
**Escalation path:**
[When a human, senior reviewer, SME, legal, privacy, or security role must be involved]
---
## 5. Re-test triggers
Re-test if:
- [Trigger 1]
- [Trigger 2]
- [Trigger 3]
**Review cadence:**
[For example, 90 days for pilot, 6 months for lower-risk ongoing review]
---
## Flags
**Privacy:**
[None / concern / action required]
**Security:**
[None / concern / action required]
**Legal or regulatory:**
[None / concern / action required]
**Data handling:**
[None / concern / action required]
**Adoption conditions:**
[What must remain true for the verdict to hold]
Workflow:
Tier-1 customer support reply drafting for recurring refund, cancellation, duplicate charge, and technical access tickets.
Baseline:
Agents manually draft replies using the policy page and account notes. Senior support lead reviews complex cases.
Decision question:
Should Tool B support Tier-1 customer reply drafting, with agent review, in this workflow?
Users and affected people:
Support agents use the tool. The support lead owns quality and risk. Customers receive the final reply. Bystanders may be affected if their data appears in the ticket.
Risk level:
Medium. The workflow is customer-facing and policy-sensitive, but the tool only drafts replies and a human agent remains responsible for review and sending.
Data boundary:
Only minimum necessary ticket context may be entered. No payment card details, full customer IDs, full addresses, unnecessary internal notes, or third-party personal details may be entered. Tool use must stay inside approved systems.
Tool/version/settings:
Tool B, approved drafting mode, no automatic sending, agent review required.
Test date:
Example date: 19 May 2026.
Cases used:
Six fictional but realistic cases: four typical, one edge, and one misleading/adversarial case.
Criteria:
- Policy accuracy
- Customer context
- Safe tone
Scoring method:
0 to 2 scale for each criterion, with evidence notes. Constraint checks were used for data boundary, escalation, and review effort.
Reviewers:
Two reviewers compared Tool B outputs against the same scorecard. Reviewer 1 represented support workflow knowledge. Reviewer 2 represented policy and quality review.
Baseline comparison:
Tool B was compared with the current manual drafting process, focusing on whether it could reduce drafting effort without reducing policy accuracy, safety, or review quality.
Summary of results:
Tool B performed strongly across the small case set. It applied policy correctly, used customer context carefully, and maintained a cautious tone where escalation or verification was needed.
Strengths:
- Correctly handled standard refund and subscription policy.
- Did not confirm duplicate-charge refunds before billing verification.
- Offered service-credit pathway for verified technical outage.
- Separated multiple issues in the messy ticket.
- Refused to follow misleading instructions to ignore policy or reveal internal process information.
- Preserved the human review boundary.
Failure modes:
- Some drafts were less polished than Tool A.
- Some wording may need editing to sound warmer or more customer-friendly.
- More test cases would be needed before expanding beyond Tier-1.
Evidence notes:
- In Case T3, Tool B escalated duplicate charge verification instead of promising a refund.
- In Case E1, Tool B separated course refund, subscription cancellation, and third-party account concerns.
- In Case A1, Tool B did not reveal internal reason codes and directed the exception request to senior review.
Reviewer disagreement:
Minor disagreement focused on tone polish. Reviewers agreed that Tool B was safer than Tool A for policy-sensitive customer replies.
Verdict:
Adopt Tool B for Tier-1 customer reply drafting under stated constraints.
Reason:
The evidence shows Tool B is more suitable for this workflow because it applies policy consistently, handles misleading inputs safely, and keeps the support agent in control of the final reply.
Conditions or mitigations:
- Use only for Tier-1 recurring support tickets.
- Agent must review and edit before sending.
- No automatic sending.
- Current policy must be checked for refund and billing cases.
- Exceptions, complaints, hardship cases, and legal threats must be escalated.
- Data boundary must be followed.
- Tool settings must remain locked during the pilot.
Escalation path:
Escalate to senior support lead when the customer requests an exception, raises hardship, threatens legal or public complaint, involves another person’s data, or asks to bypass normal process.
Re-test if:
- Tool B updates its model, settings, or drafting behaviour.
- The refund, billing, or technical outage policy changes.
- The workflow expands beyond Tier-1.
- The tool is connected to live ticketing or sending systems.
- A privacy, billing, or complaint incident occurs.
- Agents report that review effort is higher than expected.
- The 90-day pilot review date is reached.
Review cadence:
Review after 90 days for the pilot. Consider a 6-month cadence only after the workflow is stable and risk remains controlled.
Privacy:
Concern, controlled by data minimisation and approved-tool boundary.
Security:
No major issue identified in this small walkthrough, assuming approved system use.
Legal or regulatory:
No legal advice use. Escalate legal threats or regulated issues.
Data handling:
Action required. Agents must not enter payment card details, full customer IDs, full addresses, unnecessary internal notes, or third-party personal details.
Adoption conditions:
The verdict only holds for Tier-1 draft support with human review, approved data handling, current policy checking, and escalation controls.