# Provides a blank Evaluation Brief template and a completed sample recommendation.

> Source: <https://gist.github.com/sebinbenjamin/dc519db975224db854de1d308589c9c0>
> Published: 2026-05-19 03:02:16+00:00

The Evaluation Brief is the final decision artefact.

It should be short enough for a manager, product owner, legal reviewer, or security reviewer to understand quickly. The brief is not a full research report. It is a decision-ready summary of the workflow, method, evidence, risks, and recommendation.

```
# Evaluation Brief: [Tool name] for [workflow]

## 1. Context and frame

**Workflow:**  
[What workflow is being evaluated?]

**Baseline:**  
[What happens now without the AI tool?]

**Decision question:**  
[Should this tool be used for this workflow, with what level of human oversight?]

**Users and affected people:**  
[Operators, owners, end users, bystanders]

**Risk level:**  
[Low / medium / high, with one sentence explaining why]

**Data boundary:**  
[What data can and cannot enter the tool?]

---

## 2. Method

**Tool/version/settings:**  
[Tool details, version if known, relevant settings]

**Test date:**  
[Date]

**Cases used:**  
[Number and type of cases: typical, edge, misleading/adversarial]

**Criteria:**  
[Criteria used for scoring]

**Scoring method:**  
[For example, 0 to 2 scale with evidence notes]

**Reviewers:**  
[Who reviewed the outputs and what expertise they had]

**Baseline comparison:**  
[How the tool was compared with the current workflow]

---

## 3. Findings

**Summary of results:**  
[Short summary of performance by criterion]

**Strengths:**  
- [Strength 1]
- [Strength 2]
- [Strength 3]

**Failure modes:**  
- [Failure mode 1]
- [Failure mode 2]
- [Failure mode 3]

**Evidence notes:**  
- [Brief evidence note with case reference]
- [Brief evidence note with case reference]
- [Brief evidence note with case reference]

**Reviewer disagreement:**  
[Where reviewers disagreed and what that means]

---

## 4. Recommendation

**Verdict:**  
[Adopt / restrict / do not adopt]

**Reason:**  
[Why this verdict follows from the evidence]

**Conditions or mitigations:**  
- [Condition 1]
- [Condition 2]
- [Condition 3]

**Escalation path:**  
[When a human, senior reviewer, SME, legal, privacy, or security role must be involved]

---

## 5. Re-test triggers

Re-test if:

- [Trigger 1]
- [Trigger 2]
- [Trigger 3]

**Review cadence:**  
[For example, 90 days for pilot, 6 months for lower-risk ongoing review]

---

## Flags

**Privacy:**  
[None / concern / action required]

**Security:**  
[None / concern / action required]

**Legal or regulatory:**  
[None / concern / action required]

**Data handling:**  
[None / concern / action required]

**Adoption conditions:**  
[What must remain true for the verdict to hold]
```

**Workflow:**

Tier-1 customer support reply drafting for recurring refund, cancellation, duplicate charge, and technical access tickets.

**Baseline:**

Agents manually draft replies using the policy page and account notes. Senior support lead reviews complex cases.

**Decision question:**

Should Tool B support Tier-1 customer reply drafting, with agent review, in this workflow?

**Users and affected people:**

Support agents use the tool. The support lead owns quality and risk. Customers receive the final reply. Bystanders may be affected if their data appears in the ticket.

**Risk level:**

Medium. The workflow is customer-facing and policy-sensitive, but the tool only drafts replies and a human agent remains responsible for review and sending.

**Data boundary:**

Only minimum necessary ticket context may be entered. No payment card details, full customer IDs, full addresses, unnecessary internal notes, or third-party personal details may be entered. Tool use must stay inside approved systems.

**Tool/version/settings:**

Tool B, approved drafting mode, no automatic sending, agent review required.

**Test date:**

Example date: 19 May 2026.

**Cases used:**

Six fictional but realistic cases: four typical, one edge, and one misleading/adversarial case.

**Criteria:**

- Policy accuracy
- Customer context
- Safe tone

**Scoring method:**

0 to 2 scale for each criterion, with evidence notes. Constraint checks were used for data boundary, escalation, and review effort.

**Reviewers:**

Two reviewers compared Tool B outputs against the same scorecard. Reviewer 1 represented support workflow knowledge. Reviewer 2 represented policy and quality review.

**Baseline comparison:**

Tool B was compared with the current manual drafting process, focusing on whether it could reduce drafting effort without reducing policy accuracy, safety, or review quality.

**Summary of results:**

Tool B performed strongly across the small case set. It applied policy correctly, used customer context carefully, and maintained a cautious tone where escalation or verification was needed.

**Strengths:**

- Correctly handled standard refund and subscription policy.
- Did not confirm duplicate-charge refunds before billing verification.
- Offered service-credit pathway for verified technical outage.
- Separated multiple issues in the messy ticket.
- Refused to follow misleading instructions to ignore policy or reveal internal process information.
- Preserved the human review boundary.

**Failure modes:**

- Some drafts were less polished than Tool A.
- Some wording may need editing to sound warmer or more customer-friendly.
- More test cases would be needed before expanding beyond Tier-1.

**Evidence notes:**

- In Case T3, Tool B escalated duplicate charge verification instead of promising a refund.
- In Case E1, Tool B separated course refund, subscription cancellation, and third-party account concerns.
- In Case A1, Tool B did not reveal internal reason codes and directed the exception request to senior review.

**Reviewer disagreement:**

Minor disagreement focused on tone polish. Reviewers agreed that Tool B was safer than Tool A for policy-sensitive customer replies.

**Verdict:**

Adopt Tool B for Tier-1 customer reply drafting under stated constraints.

**Reason:**

The evidence shows Tool B is more suitable for this workflow because it applies policy consistently, handles misleading inputs safely, and keeps the support agent in control of the final reply.

**Conditions or mitigations:**

- Use only for Tier-1 recurring support tickets.
- Agent must review and edit before sending.
- No automatic sending.
- Current policy must be checked for refund and billing cases.
- Exceptions, complaints, hardship cases, and legal threats must be escalated.
- Data boundary must be followed.
- Tool settings must remain locked during the pilot.

**Escalation path:**

Escalate to senior support lead when the customer requests an exception, raises hardship, threatens legal or public complaint, involves another person’s data, or asks to bypass normal process.

Re-test if:

- Tool B updates its model, settings, or drafting behaviour.
- The refund, billing, or technical outage policy changes.
- The workflow expands beyond Tier-1.
- The tool is connected to live ticketing or sending systems.
- A privacy, billing, or complaint incident occurs.
- Agents report that review effort is higher than expected.
- The 90-day pilot review date is reached.

**Review cadence:**

Review after 90 days for the pilot. Consider a 6-month cadence only after the workflow is stable and risk remains controlled.

**Privacy:**

Concern, controlled by data minimisation and approved-tool boundary.

**Security:**

No major issue identified in this small walkthrough, assuming approved system use.

**Legal or regulatory:**

No legal advice use. Escalate legal threats or regulated issues.

**Data handling:**

Action required. Agents must not enter payment card details, full customer IDs, full addresses, unnecessary internal notes, or third-party personal details.

**Adoption conditions:**

The verdict only holds for Tier-1 draft support with human review, approved data handling, current policy checking, and escalation controls.
