cd /news/large-language-models/zalando-presents-mllm-based-product-… · home topics large-language-models article
[ARTICLE · art-35249] src=letsdatascience.com ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

Zalando Presents MLLM-Based Product Retrieval Evaluation

Zalando researchers published a framework using multimodal LLMs to automate product retrieval evaluation, achieving human-level accuracy at up to 1,000 times lower cost and reducing evaluation time from weeks to 20 minutes. The method, tested on 20,000 real production queries in English and German, has been deployed at Zalando for continuous search quality monitoring.

read3 min views1 publishedJun 21, 2026
Zalando Presents MLLM-Based Product Retrieval Evaluation
Image: Letsdatascience (auto-discovered)

Zalando researchers published "Retrieve, Annotate, Evaluate, Repeat" (arXiv:2409.11860, Sep 2024; ECIR 2025), a framework using multimodal LLMs to automate large-scale product retrieval evaluation. The two-stage approach first generates query-specific annotation guidelines, then performs multimodal relevance assessments over query-product pairs using both text and product images. Evaluated on 20,000 examples from real production search logs in English and German, the method matched human annotator accuracy while being up to 1,000 times cheaper and reducing evaluation time from weeks to approximately 20 minutes, according to the paper. GPT-4o, GPT-4 Turbo, and GPT-3.5 Turbo were compared across accuracy and cost. The framework has been deployed in production at Zalando for continuous search quality monitoring.

What happened

Zalando SE researchers published "Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation" (arXiv:2409.11860, submitted Sep 18, 2024), announced on the Zalando Engineering Blog in November 2024 and accepted at ECIR 2025 (Advances in Information Retrieval). The paper proposes a four-stage pipeline for evaluating product search engines at scale:

  • •extracting query-product pairs from production search logs;
  • •generating query-specific annotation guidelines per query via an LLM;
  • •performing multimodal relevance annotation using both text and product images;
  • •storing labeled pairs for continuous retrieval system evaluation.

Key results

Evaluated on 20,000 examples from real-world production queries in English and German, the MLLM framework matched human annotator accuracy. According to the paper (reported by Zalando), the approach is up to 1,000 times cheaper than human annotation and reduces evaluation time from weeks to approximately 20 minutes. Three model variants were compared: GPT-4o, GPT-4 Turbo, and GPT-3.5 Turbo.

Methodology

The framework assigns one of three relevance labels - "highly relevant," "acceptable substitute," or "irrelevant" - to each query-product pair. Annotation guidelines are generated per query, encoding key attributes (product type, color, target demographic) with importance weights. GPT-4o's vision capability is used to analyze product images alongside textual attributes for the final relevance judgment.

Error analysis

Human and MLLM annotators exhibit different error patterns. Human annotators make more brand- and category-level errors, attributed to annotation fatigue. MLLMs tend to be overly strict and occasionally misread brand names as common terms - for example, interpreting "On Vacation" as a holiday phrase rather than a brand name. Zalando recommends a hybrid approach: MLLMs for high-frequency, straightforward queries, humans for ambiguous or style-sensitive cases.

Production context

The framework is deployed at Zalando for continuous monitoring of high-frequency search queries and identification of low-performing queries for algorithmic improvement. The paper notes that semantic relevance is one signal in a broader ranking system that also incorporates personalization, availability, and pricing signals.

What to watch

Per-model agreement breakdowns and error analyses from the full paper; replication on other e-commerce platforms; follow-up operational results from Zalando's subsequent engineering work on LLM-as-judge for search quality assurance.

Scoring Rationale #

A peer-reviewed industry paper (ECIR 2025) from Zalando demonstrating up to 1,000x cost reduction for production-scale relevance evaluation using MLLMs, with concrete results on real production traffic. Detailed error analysis and production deployment confirm practical utility for e-commerce and IR practitioners, though the scope is domain-specific rather than a general LLM capability advance.

Practice with real Retail & eCommerce data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Retail & eCommerce problems

── more in #large-language-models 4 stories · sorted by recency
── more on @zalando 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/zalando-presents-mll…] indexed:0 read:3min 2026-06-21 ·