Zalando researchers published "Retrieve, Annotate, Evaluate, Repeat" (arXiv:2409.11860, Sep 2024; ECIR 2025), a framework using multimodal LLMs to automate large-scale product retrieval evaluation. The two-stage approach first generates query-specific annotation guidelines, then performs multimodal relevance assessments over query-product pairs using both text and product images. Evaluated on 20,000 examples from real production search logs in English and German, the method matched human annotator accuracy while being up to 1,000 times cheaper and reducing evaluation time from weeks to approximately 20 minutes, according to the paper. GPT-4o, GPT-4 Turbo, and GPT-3.5 Turbo were compared across accuracy and cost. The framework has been deployed in production at Zalando for continuous search quality monitoring.
What happened
Zalando SE researchers published "Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation" (arXiv:2409.11860, submitted Sep 18, 2024), announced on the Zalando Engineering Blog in November 2024 and accepted at ECIR 2025 (Advances in Information Retrieval). The paper proposes a four-stage pipeline for evaluating product search engines at scale:
- •extracting query-product pairs from production search logs;
- •generating query-specific annotation guidelines per query via an LLM;
- •performing multimodal relevance annotation using both text and product images;
- •storing labeled pairs for continuous retrieval system evaluation.
Key results
Evaluated on 20,000 examples from real-world production queries in English and German, the MLLM framework matched human annotator accuracy. According to the paper (reported by Zalando), the approach is up to 1,000 times cheaper than human annotation and reduces evaluation time from weeks to approximately 20 minutes. Three model variants were compared: GPT-4o, GPT-4 Turbo, and GPT-3.5 Turbo.
Methodology
The framework assigns one of three relevance labels - "highly relevant," "acceptable substitute," or "irrelevant" - to each query-product pair. Annotation guidelines are generated per query, encoding key attributes (product type, color, target demographic) with importance weights. GPT-4o's vision capability is used to analyze product images alongside textual attributes for the final relevance judgment.
Error analysis
Human and MLLM annotators exhibit different error patterns. Human annotators make more brand- and category-level errors, attributed to annotation fatigue. MLLMs tend to be overly strict and occasionally misread brand names as common terms - for example, interpreting "On Vacation" as a holiday phrase rather than a brand name. Zalando recommends a hybrid approach: MLLMs for high-frequency, straightforward queries, humans for ambiguous or style-sensitive cases.
Production context
The framework is deployed at Zalando for continuous monitoring of high-frequency search queries and identification of low-performing queries for algorithmic improvement. The paper notes that semantic relevance is one signal in a broader ranking system that also incorporates personalization, availability, and pricing signals.
What to watch
Per-model agreement breakdowns and error analyses from the full paper; replication on other e-commerce platforms; follow-up operational results from Zalando's subsequent engineering work on LLM-as-judge for search quality assurance.
Scoring Rationale #
A peer-reviewed industry paper (ECIR 2025) from Zalando demonstrating up to 1,000x cost reduction for production-scale relevance evaluation using MLLMs, with concrete results on real production traffic. Detailed error analysis and production deployment confirm practical utility for e-commerce and IR practitioners, though the scope is domain-specific rather than a general LLM capability advance.
Practice with real Retail & eCommerce data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Retail & eCommerce problems