cd /news/large-language-models/contrastive-reflection-for-iterative… · home topics large-language-models article
[ARTICLE · art-45925] src=arxiv.org ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

Contrastive Reflection for Iterative Prompt Optimization

Researchers introduced Contrastive Reflection, an iterative prompt-optimization framework for agentic information retrieval workflows, which uses error-anchored behavioral slices and contrastive examples to propose targeted prompt edits. On HotpotQA, the method improved exact-match accuracy from 51.4% to 60.4%, outperforming failure-only and random-evidence variants, and achieving results comparable to modern prompt optimizers like MIPROv2 (59.4%) and GEPA (57.0%).

read1 min views1 publishedJul 1, 2026

arXiv:2606.30840v1 Announce Type: new Abstract: LLM agents are becoming central to information retrieval: they issue retrieval queries, synthesize answers, and increasingly serve as judges for IR evaluation. Improving the prompts that control these agents is an optimization problem, but in applied IR settings it often looks less like blind search and more like debugging. Engineers need to know which behavior failed, which nearby behavior still worked, what distinguishes the two, and whether a prompt edit improves held-out quality without introducing regressions. We present Contrastive Reflection, an iterative prompt-optimization framework for agentic IR workflows. The framework starts from a task-centric quality definition: QA agents expose retrieval or reasoning traces, and grading agents expose dimension-level scores and rationales. These structured traces are used to identify error-anchored behavioral slices, add nearby successful examples from the same region, and ask a Teacher LLM to propose a targeted prompt edit. Candidate edits are accepted only when validation performance improves, optionally subject to regression checks. We instantiate the framework with a tree-based slice selector, but the contribution is the contrastive reflection loop rather than the tree itself. On a public HotpotQA retrieval-augmented QA setup, one tree-selected contrastive repair improves held-out exact-match accuracy from 51.4% to 60.4%. Failure-only and random-evidence variants improve less and break more previously correct examples. A light instruction-only comparison places the method near modern prompt optimizers: MIPROv2 reaches 59.4% and GEPA 57.0%. The result is an interpretable optimization loop for IR agents, aimed at making prompt repair more inspectable and validation-driven.

── more in #large-language-models 4 stories · sorted by recency
── more on @hotpotqa 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/contrastive-reflecti…] indexed:0 read:1min 2026-07-01 ·