# Building a Multi-Modal Evidence Review Agent for Damage Claims

> Source: <https://dev.to/arul_cornelious/building-a-multi-modal-evidence-review-agent-for-damage-claims-2nc6>
> Published: 2026-06-30 16:20:48+00:00

GitHub:`Arul1998/hackerrank-orchestrate-solution`

Insurance and warranty claims appear straightforward: customers describe the issue and upload photos. In reality, evidence is often incomplete, contradictory, or even intentionally misleading. Building an AI system that produces consistent, explainable decisions requires reasoning across text, images, and historical context — not simply running a vision model.

I built this for the **HackerRank Orchestrate** June 2026 challenge — a 24-hour hackathon to design a system that verifies damage claims across **cars**, **laptops**, and **packages**.

The complete source code, prompts, evaluation scripts, and report are available on GitHub:

🔗 [https://github.com/Arul1998/hackerrank-orchestrate-solution](https://github.com/Arul1998/hackerrank-orchestrate-solution)

Built with **Python, OpenAI GPT-4o, GPT-4o-mini, structured prompting, and CSV-based orchestration**.

In practice, automated claim review is messy:

Structured outputs are easier to validate, audit, integrate into downstream systems, and compare against human review. That is why the challenge requires a fixed CSV schema with fields like `claim_status`

, `risk_flags`

, `severity`

, and image-grounded justifications.

The system reads `claims.csv`

, inspects local images, and produces `output.csv`

— one structured decision per claim.

For every claim row, the agent outputs:

| Field | Meaning |
|---|---|
`evidence_standard_met` |
Are the images sufficient to evaluate the claim? |
`claim_status` |
`supported` , `contradicted` , or `not_enough_information`
|
`issue_type` / `object_part`
|
What damage is visible, and where? |
`risk_flags` |
Quality, mismatch, manipulation, or history risks |
`supporting_image_ids` |
Which images actually back the decision |
`severity` |
`none` → `high`
|

Images are treated as the **primary evidence** because they directly represent the reported damage. Chat transcripts provide context, while historical claims influence risk assessment without overriding visual evidence.

These principles guided every architectural and prompt decision:

`not_enough_information`

) rather than guessing.I compared two strategies:

The multi-stage pipeline won on the sample set, especially for wrong-object photos, conflicting multi-image evidence, and prompt-injection attempts.

```
text
┌─────────────┐     ┌──────────────────┐     ┌──────────────────────┐
│ User claim  │────▶│ Claim extraction │────▶│ Structured intent    │
│ (chat text) │     │ (GPT-4o mini)    │     │ issue, part, summary │
└─────────────┘     └──────────────────┘     └──────────┬───────────┘
                                                      │
┌─────────────┐     ┌──────────────────┐                │
│ Images 1..N │────▶│ Per-image VLM    │◀─────────────┘
│             │     │ (GPT-4o)         │
└─────────────┘     └────────┬─────────┘
                             │
                    ┌────────▼──────────┐
                    │ Decision synthesis│
                    │ (GPT-4o mini)     │
                    └────────┬──────────┘
                             │
                    ┌────────▼──────────┐
                    │ Structured output │
                    │ output.csv        │
                    └───────────────────┘
```


