Building a Multi-Modal Evidence Review Agent for Damage Claims

A developer built a multi-modal evidence review agent for damage claims using Python, OpenAI GPT-4o, and structured prompting. The system processes text, images, and historical context to produce consistent, explainable decisions for car, laptop, and package claims. The multi-stage pipeline outperformed a single-shot approach on sample data, particularly for conflicting evidence and prompt injection attempts.

GitHub: Arul1998/hackerrank-orchestrate-solution Insurance and warranty claims appear straightforward: customers describe the issue and upload photos. In reality, evidence is often incomplete, contradictory, or even intentionally misleading. Building an AI system that produces consistent, explainable decisions requires reasoning across text, images, and historical context — not simply running a vision model. I built this for the HackerRank Orchestrate June 2026 challenge — a 24-hour hackathon to design a system that verifies damage claims across cars , laptops , and packages . The complete source code, prompts, evaluation scripts, and report are available on GitHub: 🔗 https://github.com/Arul1998/hackerrank-orchestrate-solution https://github.com/Arul1998/hackerrank-orchestrate-solution Built with Python, OpenAI GPT-4o, GPT-4o-mini, structured prompting, and CSV-based orchestration . In practice, automated claim review is messy: Structured outputs are easier to validate, audit, integrate into downstream systems, and compare against human review. That is why the challenge requires a fixed CSV schema with fields like claim status , risk flags , severity , and image-grounded justifications. The system reads claims.csv , inspects local images, and produces output.csv — one structured decision per claim. For every claim row, the agent outputs: | Field | Meaning | |---|---| evidence standard met | Are the images sufficient to evaluate the claim? | claim status | supported , contradicted , or not enough information | issue type / object part | What damage is visible, and where? | risk flags | Quality, mismatch, manipulation, or history risks | supporting image ids | Which images actually back the decision | severity | none → high | Images are treated as the primary evidence because they directly represent the reported damage. Chat transcripts provide context, while historical claims influence risk assessment without overriding visual evidence. These principles guided every architectural and prompt decision: not enough information rather than guessing.I compared two strategies: The multi-stage pipeline won on the sample set, especially for wrong-object photos, conflicting multi-image evidence, and prompt-injection attempts. text ┌─────────────┐ ┌──────────────────┐ ┌──────────────────────┐ │ User claim │────▶│ Claim extraction │────▶│ Structured intent │ │ chat text │ │ GPT-4o mini │ │ issue, part, summary │ └─────────────┘ └──────────────────┘ └──────────┬───────────┘ │ ┌─────────────┐ ┌──────────────────┐ │ │ Images 1..N │────▶│ Per-image VLM │◀─────────────┘ │ │ │ GPT-4o │ └─────────────┘ └────────┬─────────┘ │ ┌────────▼──────────┐ │ Decision synthesis│ │ GPT-4o mini │ └────────┬──────────┘ │ ┌────────▼──────────┐ │ Structured output │ │ output.csv │ └───────────────────┘