Build a Multimodal Agentic RAG App with Gemini Embedding 2 and Google ADK Google has released a new open-source tutorial demonstrating how to build a multimodal agentic retrieval-augmented generation (RAG) application using Gemini Embedding 2 and the Google Agent Development Kit (ADK). The application unifies text, PDFs, images, audio, and video into a single embedding space, allowing a single retrieval to power both an AI agent’s grounded answer and a visual citation panel. unwind ai ../ - Posts - Build a Multimodal Agentic RAG App with Gemini Embedding 2 and Google ADK Build a Multimodal Agentic RAG App with Gemini Embedding 2 and Google ADK 100% open source If you have built a RAG app before, you know how quickly the "just retrieve the right chunk" problem fragments the moment your sources stop being plain text. Product PDFs, UI screenshots, recorded calls, demo videos, and support notes all carry the answer your user is asking for, but each lives in its own embedding silo. Stitching them together usually means three pipelines, two vector stores, and a glue layer you regret pretty soon. In this tutorial, you'll build a fully-working multimodal agentic RAG app where text, URLs, PDFs, images, audio, and video all share a single 768-dimension embedding space, and a small Google Agent Development Kit ADK coordinator turns the retrieved evidence into a grounded, cited answer. The two pieces doing the heavy lifting are Gemini Embedding 2 , which embeds every modality into the same vector space, and Google ADK , which wraps the retrieval call in an agent that inspects the workspace, calls the retrieval tool, and writes the answer. You'll see exactly how those two pieces compose without any extra orchestration framework. What We’re Building A multimodal Agentic RAG demo where you can drop in any file or URL and ask questions across the whole index. The same retrieval packet powers both the answer and the citation panel, so the UI never disagrees with the model. Key features: Truly multimodal index — text, URLs, PDFs, images, audio, and video all live in one cosine-similarity space. Gemini Embedding 2 with task prefixes — separate prefixes for documents and queries to improve retrieval quality. Google ADK agent — coordinates inspect embedding space and retrieve relevant context tools, then synthesizes a grounded answer. Single retrieval, two consumers — /ask retrieves once, then passes the same packet to the agent and to the UI. 3D PCA embedding view — every source is one point; ask a question and the query and cited sources light up in the same projection. SSRF-safe URL ingestion — private and loopback IPs blocked unless you opt in. How It Works End-to-end, one question flows like this: You add sources. Each source is chunked text/URL or uploaded once PDF/image/audio/video . Every chunk gets a Gemini Embedding 2 vector with the task: retrieval document prefix. Files get a media vector blended with a text annotation vector, so titles still help retrieval. You ask a question. /ask embeds the query with the task: question answering | query prefix, scores every chunk by cosine similarity, keeps the best chunk per source, takes the top k , and projects everything into 3D using power-iteration PCA. The agent runs. run adk agent builds a per-request agent whose retrieve relevant context tool is a closure over the already-computed retrieval packet. The agent calls inspect embedding space , then "calls" the retrieval tool, then writes a grounded answer with no inline citation IDs. The UI renders. The frontend shows the answer text, the citation panel built from the same matches , the agent trace, and the updated 3D view with the query point and highlighted sources. The architectural insight is the single-retrieval contract : one query embedding, one ranked list of matches, two consumers the agent and the UI . That's what keeps citations honest. Prerequisites Before we begin, make sure you have the following: Python installed on your machine version 3.12 is recommended Your Gemini API key https://aistudio.google.com/api-keys?utm source=www.theunwindai.com&utm medium=referral&utm campaign=build-a-multimodal-agentic-rag-app-with-gemini-embedding-2-and-google-adk for using Gemini Embedding 2A code editor of your choice Basic Python and FastAPI familiarity Code Walkthrough Setting Up the Environment First, let's get our development environment ready: Clone the GitHub repository: git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git Go to the multimodal agentic rag folder: cd rag tutorials/multimodal agentic rag/backend Install the required dependencies https://github.com/Shubhamsaboo/awesome-llm-apps/blob/main/voice ai agents/insurance claim live agent team/requirements.txt?utm source=www.theunwindai.com&utm medium=referral&utm campaign=build-a-multimodal-agentic-rag-app-with-gemini-embedding-2-and-google-adk : pip install -r requirements.txt Grab your Gemini API key from Google AI Studio https://aistudio.google.com/api-keys?utm source=www.theunwindai.com&utm medium=referral&utm campaign=build-a-multimodal-agentic-rag-app-with-gemini-embedding-2-and-google-adk and set it in your current session: export GOOGLE API KEY="your-google-ai-studio-key" Creating the App Project structure: rag tutorials/multimodal agentic rag/ |-- README.md |-- assets/ | -- multimodal-agentic-rag-architecture.png |-- backend/ | |-- app state.py | |-- rag store.py | |-- requirements.txt | |-- server.py | -- agentic rag agent/ | |-- init .py | -- agent.py -- frontend/ |-- index.html |-- package.json |-- src/ | |-- App.tsx | |-- main.tsx | -- styles.css |-- tsconfig.json -- vite.config.ts We’ll skip the frontend code walkthrough and focus on the backend architecture. 1. The Shared Store app state.py A single line keeps the in-memory index addressable from both FastAPI and the ADK tools: python from rag store import MultimodalRagStore RAG STORE = MultimodalRagStore Both server.py and the ADK tool functions import RAG STORE from here, so the agent always sees the same sources you uploaded through the UI. 2. The Multimodal Store rag store.py This is where most of the interesting code lives. A few constants set the contract: EMBED MODEL = "gemini-embedding-2" DEFAULT DIMENSIONS = 768 CHUNK WORDS = 170 CHUNK OVERLAP = 35 INLINE MEDIA LIMIT BYTES = 18 1024 1024 We chunk text into roughly 170-word windows with 35-word overlap and embed each chunk separately. Anything bigger than ~18 MB or any audio/video file goes through the Gemini File API instead of inline bytes. Embedding text with task prefixes Gemini Embedding 2 supports task prefixes — small instructions like "task: retrieval document" or "task: question answering | query" that tell the model how this content will be used. Documents and queries get different prefixes, which measurably improves retrieval: php def embed text self, text: str, task prefix: str - list float : content = f"{task prefix}: {text}" client = self. require client result = client.models.embed content model=EMBED MODEL, contents= content , config=types.EmbedContentConfig output dimensionality=self.dimensions , return result.embeddings 0 .values The interesting bit is output dimensionality=768 : Gemini Embedding 2 supports truncating to smaller, latency-friendlier vectors right at the API call, so you don't have to pay for storage or cosine math on the full embedding width. Embedding files PDFs, images, audio, video Multimodal is where Gemini Embedding 2 earns its keep. Small images and PDFs go inline; large files and all media go through the File API: python def embed file self, data, mime type, title, notes : client = self. require client use file api = len data INLINE MEDIA LIMIT BYTES or mime type.startswith "video/" or mime type.startswith "audio/" if use file api: return self. embed uploaded file data, mime type, title , "gemini-file-api" part = types.Part.from bytes data=data, mime type=mime type result = client.models.embed content model=EMBED MODEL, contents= part , config=types.EmbedContentConfig output dimensionality=self.dimensions , return result.embeddings 0 .values, "gemini-inline" The File API path uploads the file, polls until its state is ACTIVE / SUCCEEDED , embeds via Part.from uri , and then deletes the uploaded file in a finally block — important so you don't leak storage on every upload. To make a PDF or image still findable by its title e.g., "the launch deck" , we blend the media vector with a text vector of the title plus user-provided notes: media vector, embedding path = self. embed file ... annotation vector = self. embed text f"{title}. {notes}", "task: retrieval document" vector = blend vectors media vector, annotation vector 68% media / 32% text This is a small but very effective trick: native multimodal embeddings are great at semantic content, but humans often search by the label they gave the file. Search: cosine similarity per chunk, deduplicated per source php def search self, query: str, top k: int = 6 - dict str, Any : query vector = self. embed text query, "task: question answering | query" source vectors = self. source vectors projections = self. pca projection { source vectors, query id: query vector} ... for chunk in self.chunks: score = round cosine query vector, chunk.vector , 4 current = source matches.get chunk.source id if not current or score current "score" : source matches chunk.source id = { ... } matches = sorted source matches.values , key=lambda m: m "score" , reverse=True :top k Three subtle decisions here: we score every chunk but keep only the best chunk per source , we project source vectors and the query vector together so the 3D view shares the same basis, and we return a fully-formed space snapshot so the frontend never has to ask twice. PCA projection in pure Python The pca projection method runs power iteration to find the top three principal components and projects every vector into 3D — no NumPy, no scikit-learn. That keeps the dependency list short and the projection deterministic per request. The retrieval payload The agent doesn't see raw chunks; it sees a clean, model-friendly payload: python def retrieval payload self, results : return { "provider": self.embedding provider, "matches": { "citation": m "id" , "source": m "title" , "modality": m "modality" , "similarity": m "score" , "evidence": m "text" , } for m in results "matches" , } This is the exact same packet that /ask returns to the frontend, which is how we guarantee the answer and the citation panel never drift. 3. The ADK Agent agentic rag agent/agent.py A short, sharp ADK agent with two tools and a focused instruction: php def retrieve relevant context query: str, top k: int = 5 - dict: """Retrieve the most relevant multimodal source evidence for a user question.""" return RAG STORE.retrieval tool query=query, top k=top k def inspect embedding space - dict: """Inspect current sources, modalities, dimensions, and embedding provider.""" return RAG STORE.space tool def build agent retrieval tool=retrieve relevant context - Agent: return Agent name="multimodal agentic rag agent", model="gemini-3-flash-preview", description="Agentic RAG coordinator for a multimodal Gemini Embedding 2 workspace.", instruction=""" You are the Google ADK coordinator for a multimodal agentic RAG workspace. For every user question: 1. Use inspect embedding space to understand the current workspace. 2. Use retrieve relevant context with the user's question before answering. 3. Ground the answer in the retrieved evidence. Do not invent facts... 4. Do not include raw citation ids, source ids, bracket citations... 5. Start with a clear direct answer in 2-3 sentences. 6. If helpful, add a short "Key points:" section with simple hyphen bullets. """, tools= inspect embedding space, retrieval tool , generate content config=genai types.GenerateContentConfig temperature=0.25, max output tokens=900, , build agent accepts an injectable retrieval tool . That's how server.py swaps in a closure that returns the already-computed retrieval packet, instead of letting the agent embed the query a second time. 4. The FastAPI Server server.py The endpoint surface is small and predictable: Method | Endpoint | What it does | |---|---|---| | | Liveness, ADK availability, dimensions, source counts | | | Current sources, points, events, projection metadata | | | Add a text source | | | Fetch and index a public URL SSRF-protected | | | Upload PDF, image, audio, or video | | | Remove a source and its chunks | | | Retrieve once, run ADK answer flow, return citations | The key piece is /ask . It retrieves once, builds a clean payload, and injects a closure into the agent so it can't redo the embedding: python @app.post "/ask" async def ask req: AskRequest : retrieval = await run in threadpool RAG STORE.search, req.question, req.top k retrieval payload = RAG STORE.retrieval payload retrieval answer = await run adk agent req.question, retrieval payload trace = {"agent": "space inspector", "status": "complete", "detail": ...}, {"agent": "retrieval tool", "status": "complete", "detail": ...}, {"agent": "answer synthesizer","status": "complete", "detail": ...}, return { "answer": answer, "matches": retrieval "matches" , "query point": retrieval "query point" , "trace": trace, "space": retrieval "space" , } And the closure injection inside run adk agent : php async def run adk agent question: str, retrieval: dict str, Any - str: def retrieve relevant context query: str, top k: int = 6 - dict: """Return the exact retrieval packet already embedded for this request.""" return retrieval request agent = build agent retrieve relevant context request runner = Runner agent=request agent, app name=APP NAME, session service=session service session = await session service.create session app name=APP NAME, user id=USER ID content = genai types.Content role="user", parts= genai types.Part text=f"Question: {question}\nUse the retrieval tool result for this exact question." , final text = "" async for event in request runner.run async user id=USER ID, session id=session.id, new message=content : text = event text event if text: final text = text return final text The agent thinks it's calling a real retrieval tool. It is — the tool just returns a cached result. This is a clean way to keep agent semantics while skipping a redundant embedding round-trip. A couple of safety details worth highlighting: SSRF protection : validate fetch url rejects non-HTTP schemes and resolves the hostname; if any returned IP is private, loopback, link-local, or reserved, ingestion fails. Set ALLOW PRIVATE URLS=true only when you really need it. Threadpool offloading : every blocking call text chunking, file reads, search, PCA runs in run in threadpool so the FastAPI event loop stays responsive. Configurable CORS : ALLOWED ORIGINS is read from the env, defaulting to the Vite dev server. 5. The Frontend very brief The frontend is a single React/Vite app frontend/src/App.tsx that wraps three panels: a source manager for adding text/URLs/files, a Q&A panel that calls /ask and renders the answer plus a separate citations list, and a 3D embedding view built on Three.js that uses the projection coordinates returned by the backend. Every source is one colored point color encodes modality , and after a question the query point and the cited sources are highlighted in the same PCA basis. Running the App With our code in place, it's time to launch the app. Start the backend python server.py The backend listens on http://localhost:8897 . Start the frontend in a second terminal: cd multimodal agentic rag/frontend npm install npm run dev -- --port 5177 If your backend lives on a different port, point the frontend at it: VITE API URL=http://localhost:8897 npm run dev -- --port 5177 Add a few sources — try a paragraph of text, a public URL, a PDF, and an image. Watch them appear as colored points in the embedding view. Ask a question in the Q&A panel. Inspect the answer, the cited sources, and the agent trace. Notice the orange query point land near the sources the agent cites. A quick health check from the terminal: curl http://localhost:8897/health Expected response shape on a fresh start the store begins empty : { "status": "ok", "adk": true, "setup error": "", "sources": 0, "chunks": 0, "dimensions": 768, "provider": "gemini-embedding-2", "modalities": {}, "chunk modalities": {}, "projection": "pca 3d" Working Application Demo Conclusion You've now built a multimodal agentic RAG app that puts text, URLs, PDFs, images, audio, and video into a single Gemini Embedding 2 space, retrieves with cosine similarity over chunked vectors, and uses a tightly-scoped Google ADK agent to write grounded, citation-friendly answers, without a separate vector database, in a few hundred lines of Python. A few directions worth exploring from here: Swap the in-memory store for a managed vector DB pgvector, Qdrant, Vertex AI Vector Search and persist the chunk metadata. Add re-ranking with a cross-encoder or a Gemini reranker between cosine retrieval and the agent. Background ingestion with a queue Celery, RQ, or a simple async worker so large videos don't block the API. Evals: wire a small eval set with question/answer pairs and track citation precision and answer faithfulness over changes. Auth + multi-tenancy so different users see different workspaces. Observability : log the retrieval packet alongside the final answer; the single-retrieval contract makes faithfulness audits straightforward. Keep experimenting with different configurations and features to build more sophisticated AI applications. We share hands-on tutorials like this 2-3 times a week, to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials. https://www.theunwindai.com/subscribe