📄Paper: RORA-VLM: Robust Retrieval Augmentation for Vision Language Models

Researchers have developed RORA-VLM, a robust retrieval-augmented framework for vision-language models that enables accurate question-answering using external knowledge even when retrieved information contains noise. The system employs a two-stage retrieval process—first using an image to find similar entities from a 37-million-image database, then expanding the query with entity names for text retrieval—along with query-oriented visual token refinement and noise-resilient training that deliberately introduces incorrect retrievals to teach the model to ignore irrelevant information. Presented at ICLR 2025, the approach addresses visual question-answering tasks where answers require background knowledge not present in the image itself.

Public At International Conference on Learning Representations ICLR 2025 💡 Why I read this 最近在找論文的 idea 剛好找到這篇，發表在 ICLR 2025，不過被 Reject 了有點可惜 這篇主要是把 RAG 應用到 VLM ，讓模型在回答問題時可以利用外部知識 在很多 VQA 的任務中，答案其實不在圖片裡面，而是需要額外的背景知識 例如一張圖顯示一種鳥，問題是：「這種鳥主要分布在哪裡？」 圖片只能讓你看出鳥長什麼樣，但像棲地這種資訊一定要查資料才知道 這篇主要在解決：「當 retrieved knowledge 有 noise 時，VLM 怎麼還能穩定推理？ 作者提出一個 robust retrieval framework 給 VLM： 1. Two-stage retrieval 先用 image retrieve 相似 entity，再用 entity expansion 做 text retrieval。 在第一個階段，他們把 query image 當作一個「anchor」，去資料庫裡找很多長得很像的圖片。 他們用的資料庫叫 WIT https://github.com/google-research-datasets/wit ，裡面有 3700 萬張圖片，每張圖片都搭配一個 entity 的名字跟描述。 在第二個階段，他們把在第一個階段拿到的 entity 名稱、描述加進原本的問題裡面，變成一個更具體的 query，再去用 google 查知識 call api ✨ For Example - 原本的問句: - which year was this building built? - 找到的 Entity - Castle of Good Hope - 新的 Query 原本的問句 + entity - which year was Castle of Good Hope built? 2. Query-oriented visual token refinement 只保留和 query 最相關的 visual tokens，減少 image background noise。 一開始有兩個輸入：問題和圖片。 在 VLM 裡面，一張圖片會被切成很多塊，每個區塊會變成一個 visual token。 接下來，模型會根據問題的內容，計算每一塊 image patch 和 query 的相關性。 與問題比較相關的區塊會被保留下來，不相關的就被忽略。 對於每一張檢索到的圖片，也會做一樣的篩選，用「query image 的比較重要的幾個 patch」來判斷，只留下和 query image 相關的區塊。 最後留下的這些區塊，會轉成對應的 visual tokens，並以 sequence 的形式排列 refined visual tokens ，作為 VLM 的 Input 也就是模型最後看到的圖片資訊，其實已經被篩選過了。 中間那些綠色的區塊，其實代表的是，每個 patch 和問題之間的相關性分數。 3. Noise-resilient RAG training 時故意加入錯誤 retrieval，讓 model 學會忽略 irrelevant knowledge。 VLM 會同時看到：原始圖片、問題、還有多筆查到的知識（圖片 + 文字） 這些 retrieval 結果裡面，有些是正確的，有些是錯的。 模型要做的事就是根據相關程度 每張圖片與 query 到的 image ，決定要相信哪一段資訊。 👉 綠色 = 高 attention 👉 紅色 = 忽略 經過這個過程，模型可以回答問題，例如這個建築是在 1666–1679 年建造的。 📄Soure https://openreview.net/pdf/1dff65b976d44f89183d623a8d26842e17ed51da.pdf https://openreview.net/pdf/1dff65b976d44f89183d623a8d26842e17ed51da.pdf