Zalando Presents MLLM-Based Product Retrieval Evaluation

wpnews.pro

cd /news/large-language-models/zalando-presents-mllm-based-product-… · home › topics › large-language-models › article

[ARTICLE · art-35249] src=letsdatascience.com ↗ pub=2026-06-21T01:37Z topic=large-language-models verified=true sentiment=↑ positive

Zalando Presents MLLM-Based Product Retrieval Evaluation

Zalando researchers published a framework using multimodal LLMs to automate product retrieval evaluation, achieving human-level accuracy at up to 1,000 times lower cost and reducing evaluation time from weeks to 20 minutes. The method, tested on 20,000 real production queries in English and German, has been deployed at Zalando for continuous search quality monitoring.

read3 min views1 publishedJun 21, 2026

Zalando Presents MLLM-Based Product Retrieval Evaluation — Image: Letsdatascience (auto-discovered)

Zalando researchers published "Retrieve, Annotate, Evaluate, Repeat" (arXiv:2409.11860, Sep 2024; ECIR 2025), a framework using multimodal LLMs to automate large-scale product retrieval evaluation. The two-stage approach first generates query-specific annotation guidelines, then performs multimodal relevance assessments over query-product pairs using both text and product images. Evaluated on 20,000 examples from real production search logs in English and German, the method matched human annotator accuracy while being up to 1,000 times cheaper and reducing evaluation time from weeks to approximately 20 minutes, according to the paper. GPT-4o, GPT-4 Turbo, and GPT-3.5 Turbo were compared across accuracy and cost. The framework has been deployed in production at Zalando for continuous search quality monitoring.

What happened

Zalando SE researchers published "Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation" (arXiv:2409.11860, submitted Sep 18, 2024), announced on the Zalando Engineering Blog in November 2024 and accepted at ECIR 2025 (Advances in Information Retrieval). The paper proposes a four-stage pipeline for evaluating product search engines at scale:

•extracting query-product pairs from production search logs;
•generating query-specific annotation guidelines per query via an LLM;
•performing multimodal relevance annotation using both text and product images;
•storing labeled pairs for continuous retrieval system evaluation.

Key results

Evaluated on 20,000 examples from real-world production queries in English and German, the MLLM framework matched human annotator accuracy. According to the paper (reported by Zalando), the approach is up to 1,000 times cheaper than human annotation and reduces evaluation time from weeks to approximately 20 minutes. Three model variants were compared: GPT-4o, GPT-4 Turbo, and GPT-3.5 Turbo.

Methodology

The framework assigns one of three relevance labels - "highly relevant," "acceptable substitute," or "irrelevant" - to each query-product pair. Annotation guidelines are generated per query, encoding key attributes (product type, color, target demographic) with importance weights. GPT-4o's vision capability is used to analyze product images alongside textual attributes for the final relevance judgment.

Error analysis

Human and MLLM annotators exhibit different error patterns. Human annotators make more brand- and category-level errors, attributed to annotation fatigue. MLLMs tend to be overly strict and occasionally misread brand names as common terms - for example, interpreting "On Vacation" as a holiday phrase rather than a brand name. Zalando recommends a hybrid approach: MLLMs for high-frequency, straightforward queries, humans for ambiguous or style-sensitive cases.

Production context

The framework is deployed at Zalando for continuous monitoring of high-frequency search queries and identification of low-performing queries for algorithmic improvement. The paper notes that semantic relevance is one signal in a broader ranking system that also incorporates personalization, availability, and pricing signals.

What to watch

Per-model agreement breakdowns and error analyses from the full paper; replication on other e-commerce platforms; follow-up operational results from Zalando's subsequent engineering work on LLM-as-judge for search quality assurance.

Scoring Rationale #

A peer-reviewed industry paper (ECIR 2025) from Zalando demonstrating up to 1,000x cost reduction for production-scale relevance evaluation using MLLMs, with concrete results on real production traffic. Detailed error analysis and production deployment confirm practical utility for e-commerce and IR practitioners, though the scope is domain-specific rather than a general LLM capability advance.

Practice with real Retail & eCommerce data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Retail & eCommerce problems

source & further reading

letsdatascience.com — original article UC Berkeley Robot Learns Motor Tasks Autonomously Jay Alammar Publishes Explainable AI Cheat Sheet Agentic AI Reshapes Business Productivity in 2026

~/api · this article 200

$curl api.wpnews.pro/v1/news/zalando-presents-mllm-ba…

Read original on letsdatascience.com → letsdatascience.com/news/zalando-presents-mllm-b…

mentioned entities

Zalando

GPT-4o

GPT-4 Turbo

GPT-3.5 Turbo

ECIR 2025

metadata

slugzalando-presents-mllm-based-product-retrieval-evaluation

topic#large-language-models

secondary3 topics

sentimentpositive

canonicalletsdatascience.com

navigation

← prev46mm and 42mm Apple Watch Series…

next →Visual Studio Code 1.126

── more in #large-language-models 4 stories · sorted by recency

dev.to · 21 Jun · #large-language-models

Building a Practical AI Assistant with Python: From Prompt to Production Thinking

dev.to · 21 Jun · #large-language-models

Precision Medicine RAG: Building a Clinical Trial Search Engine with Hybrid Search and BGE-M3

livekit.com · 20 Jun · #large-language-models

LiveKit Solves Turn Detection

pcguide.com · 21 Jun · #large-language-models

Ryzen Threadripper deals ahead of Prime Day drop prices of the 9980X, 9970X and 9960X to their lowest ever on Amazon

── more on @zalando 3 stories trending now

wpnews · 20 Jun · #ai-safety

SR 11-7 Model Risk for AI Systems: What Banks Actually Need to Build

wpnews · 20 Jun · #ai-agents

Amazon Bedrock AgentCore Memory: Build AI Agents That Remember

wpnews · 20 Jun · #artificial-intelligence

Building a Voice AI Platform with 28 Modules in Python

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required