{"slug": "build-a-multimodal-agentic-rag-app-with-gemini-embedding-2-and-google-adk", "title": "Build a Multimodal Agentic RAG App with Gemini Embedding 2 and Google ADK", "summary": "Google has released a new open-source tutorial demonstrating how to build a multimodal agentic retrieval-augmented generation (RAG) application using Gemini Embedding 2 and the Google Agent Development Kit (ADK). The application unifies text, PDFs, images, audio, and video into a single embedding space, allowing a single retrieval to power both an AI agent’s grounded answer and a visual citation panel.", "body_md": "[unwind ai](../)- Posts\n- Build a Multimodal Agentic RAG App with Gemini Embedding 2 and Google ADK\n\n# Build a Multimodal Agentic RAG App with Gemini Embedding 2 and Google ADK\n\n## (100% open source)\n\nIf you have built a RAG app before, you know how quickly the \"just retrieve the right chunk\" problem fragments the moment your sources stop being plain text. Product PDFs, UI screenshots, recorded calls, demo videos, and support notes all carry the answer your user is asking for, but each lives in its own embedding silo.\n\nStitching them together usually means three pipelines, two vector stores, and a glue layer you regret pretty soon.\n\nIn this tutorial, you'll build a fully-working **multimodal agentic RAG app** where text, URLs, PDFs, images, audio, and video all share a single 768-dimension embedding space, and a small Google Agent Development Kit (ADK) coordinator turns the retrieved evidence into a grounded, cited answer.\n\nThe two pieces doing the heavy lifting are **Gemini Embedding 2**, which embeds every modality into the same vector space, and **Google ADK**, which wraps the retrieval call in an agent that inspects the workspace, calls the retrieval tool, and writes the answer.\n\nYou'll see exactly how those two pieces compose without any extra orchestration framework.\n\n**What We’re Building**\n\nA multimodal Agentic RAG demo where you can drop in any file or URL and ask questions across the whole index. The same retrieval packet powers both the answer and the citation panel, so the UI never disagrees with the model.\n\n**Key features:**\n\n**Truly multimodal index**— text, URLs, PDFs, images, audio, and video all live in one cosine-similarity space.** Gemini Embedding 2 with task prefixes**— separate prefixes for documents and queries to improve retrieval quality.** Google ADK agent**— coordinates`inspect_embedding_space`\n\nand`retrieve_relevant_context`\n\ntools, then synthesizes a grounded answer.**Single retrieval, two consumers**—`/ask`\n\nretrieves once, then passes the same packet to the agent and to the UI.**3D PCA embedding view**— every source is one point; ask a question and the query and cited sources light up in the same projection.** SSRF-safe URL ingestion**— private and loopback IPs blocked unless you opt in.\n\n**How It Works**\n\nEnd-to-end, one question flows like this:\n\n**You add sources.** Each source is chunked (text/URL) or uploaded once (PDF/image/audio/video). Every chunk gets a Gemini Embedding 2 vector with the`task: retrieval document`\n\nprefix. Files get a media vector blended with a text annotation vector, so titles still help retrieval.**You ask a question.**`/ask`\n\nembeds the query with the`task: question answering | query`\n\nprefix, scores every chunk by cosine similarity, keeps the best chunk per source, takes the top*k*, and projects everything into 3D using power-iteration PCA.**The agent runs.**`_run_adk_agent`\n\nbuilds a per-request agent whose`retrieve_relevant_context`\n\ntool is a closure over the already-computed retrieval packet. The agent calls`inspect_embedding_space`\n\n, then \"calls\" the retrieval tool, then writes a grounded answer with no inline citation IDs.**The UI renders.** The frontend shows the answer text, the citation panel (built from the same`matches`\n\n), the agent trace, and the updated 3D view with the query point and highlighted sources.\n\nThe architectural insight is the **single-retrieval contract**: one query embedding, one ranked list of matches, two consumers (the agent and the UI). That's what keeps citations honest.\n\n**Prerequisites**\n\nBefore we begin, make sure you have the following:\n\nPython installed on your machine (version 3.12 is recommended)\n\nYour\n\n[Gemini API key](https://aistudio.google.com/api-keys?utm_source=www.theunwindai.com&utm_medium=referral&utm_campaign=build-a-multimodal-agentic-rag-app-with-gemini-embedding-2-and-google-adk)for using Gemini Embedding 2A code editor of your choice\n\nBasic Python and FastAPI familiarity\n\n**Code Walkthrough**\n\n**Setting Up the Environment**\n\nFirst, let's get our development environment ready:\n\nClone the GitHub repository:\n\n```\ngit clone https://github.com/Shubhamsaboo/awesome-llm-apps.git\n```\n\nGo to the\n\n**multimodal_agentic_rag** folder:\n\n```\ncd rag_tutorials/multimodal_agentic_rag/backend\n```\n\nInstall the\n\n[required dependencies](https://github.com/Shubhamsaboo/awesome-llm-apps/blob/main/voice_ai_agents/insurance_claim_live_agent_team/requirements.txt?utm_source=www.theunwindai.com&utm_medium=referral&utm_campaign=build-a-multimodal-agentic-rag-app-with-gemini-embedding-2-and-google-adk):\n\n```\npip install -r requirements.txt\n```\n\nGrab your\n\n[Gemini API key from Google AI Studio](https://aistudio.google.com/api-keys?utm_source=www.theunwindai.com&utm_medium=referral&utm_campaign=build-a-multimodal-agentic-rag-app-with-gemini-embedding-2-and-google-adk)and set it in your current session:\n\n```\nexport GOOGLE_API_KEY=\"your-google-ai-studio-key\"\n```\n\n**Creating the App**\n\nProject structure:\n\n```\nrag_tutorials/multimodal_agentic_rag/\n|-- README.md\n|-- assets/\n|   `-- multimodal-agentic-rag-architecture.png\n|-- backend/\n|   |-- app_state.py\n|   |-- rag_store.py\n|   |-- requirements.txt\n|   |-- server.py\n|   `-- agentic_rag_agent/\n|       |-- __init__.py\n|       `-- agent.py\n`-- frontend/\n    |-- index.html\n    |-- package.json\n    |-- src/\n    |   |-- App.tsx\n    |   |-- main.tsx\n    |   `-- styles.css\n    |-- tsconfig.json\n    `-- vite.config.ts\n```\n\nWe’ll skip the frontend code walkthrough and focus on the backend architecture.\n\n**1. The Shared Store (**`app_state.py`\n\n**)**\n\nA single line keeps the in-memory index addressable from both FastAPI and the ADK tools:\n\n``` python\nfrom rag_store import MultimodalRagStore\n\nRAG_STORE = MultimodalRagStore()\n```\n\nBoth `server.py`\n\nand the ADK tool functions import `RAG_STORE`\n\nfrom here, so the agent always sees the same sources you uploaded through the UI.\n\n**2. The Multimodal Store (**`rag_store.py`\n\n**)**\n\nThis is where most of the interesting code lives. A few constants set the contract:\n\n```\nEMBED_MODEL = \"gemini-embedding-2\"\nDEFAULT_DIMENSIONS = 768\nCHUNK_WORDS = 170\nCHUNK_OVERLAP = 35\nINLINE_MEDIA_LIMIT_BYTES = 18 * 1024 * 1024\n```\n\nWe chunk text into roughly 170-word windows with 35-word overlap and embed each chunk separately. Anything bigger than ~18 MB or any audio/video file goes through the **Gemini File API** instead of inline bytes.\n\n**Embedding text with task prefixes**\n\nGemini Embedding 2 supports task prefixes — small instructions like `\"task: retrieval document\"`\n\nor `\"task: question answering | query\"`\n\nthat tell the model how this content will be used. Documents and queries get different prefixes, which measurably improves retrieval:\n\n``` php\ndef _embed_text(self, text: str, task_prefix: str) -> list[float]:\n    content = f\"{task_prefix}: {text}\"\n    client = self._require_client()\n\n    result = client.models.embed_content(\n        model=EMBED_MODEL,\n        contents=[content],\n        config=types.EmbedContentConfig(output_dimensionality=self.dimensions),\n    )\n    return result.embeddings[0].values\n```\n\nThe interesting bit is `output_dimensionality=768`\n\n: Gemini Embedding 2 supports truncating to smaller, latency-friendlier vectors right at the API call, so you don't have to pay for storage or cosine math on the full embedding width.\n\n**Embedding files (PDFs, images, audio, video)**\n\nMultimodal is where Gemini Embedding 2 earns its keep. Small images and PDFs go inline; large files and all media go through the File API:\n\n``` python\ndef _embed_file(self, data, mime_type, title, notes):\n    client = self._require_client()\n\n    use_file_api = (\n        len(data) > INLINE_MEDIA_LIMIT_BYTES\n        or mime_type.startswith(\"video/\")\n        or mime_type.startswith(\"audio/\")\n    )\n    if use_file_api:\n        return self._embed_uploaded_file(data, mime_type, title), \"gemini-file-api\"\n\n    part = types.Part.from_bytes(data=data, mime_type=mime_type)\n    result = client.models.embed_content(\n        model=EMBED_MODEL,\n        contents=[part],\n        config=types.EmbedContentConfig(output_dimensionality=self.dimensions),\n    )\n    return result.embeddings[0].values, \"gemini-inline\"\n```\n\nThe File API path uploads the file, polls until its state is `ACTIVE`\n\n/`SUCCEEDED`\n\n, embeds via `Part.from_uri`\n\n, and then **deletes the uploaded file** in a `finally`\n\nblock — important so you don't leak storage on every upload.\n\nTo make a PDF or image still findable by its title (e.g., \"the launch deck\"), we blend the media vector with a text vector of the title plus user-provided notes:\n\n```\nmedia_vector, embedding_path = self._embed_file(...)\nannotation_vector = self._embed_text(f\"{title}. {notes}\", \"task: retrieval document\")\nvector = _blend_vectors(media_vector, annotation_vector)  # 68% media / 32% text\n```\n\nThis is a small but very effective trick: native multimodal embeddings are great at semantic content, but humans often search by the label they gave the file.\n\n**Search: cosine similarity per chunk, deduplicated per source**\n\n``` php\ndef search(self, query: str, top_k: int = 6) -> dict[str, Any]:\n    query_vector = self._embed_text(query, \"task: question answering | query\")\n    source_vectors = self._source_vectors()\n    projections = self._pca_projection({**source_vectors, query_id: query_vector})\n    ...\n    for chunk in self.chunks:\n        score = round(_cosine(query_vector, chunk.vector), 4)\n        current = source_matches.get(chunk.source_id)\n        if not current or score > current[\"score\"]:\n            source_matches[chunk.source_id] = { ... }\n    matches = sorted(source_matches.values(), key=lambda m: m[\"score\"], reverse=True)[:top_k]\n```\n\nThree subtle decisions here: we score every chunk but keep only the **best chunk per source**, we project source vectors and the query vector together so the 3D view shares the same basis, and we return a fully-formed `space`\n\nsnapshot so the frontend never has to ask twice.\n\n**PCA projection in pure Python**\n\nThe `_pca_projection`\n\nmethod runs power iteration to find the top three principal components and projects every vector into 3D — no NumPy, no scikit-learn. That keeps the dependency list short and the projection deterministic per request.\n\n**The retrieval payload**\n\nThe agent doesn't see raw chunks; it sees a clean, model-friendly payload:\n\n``` python\ndef retrieval_payload(self, results):\n    return {\n        \"provider\": self.embedding_provider,\n        \"matches\": [\n            {\n                \"citation\": m[\"id\"],\n                \"source\": m[\"title\"],\n                \"modality\": m[\"modality\"],\n                \"similarity\": m[\"score\"],\n                \"evidence\": m[\"text\"],\n            }\n            for m in results[\"matches\"]\n        ],\n    }\n```\n\nThis is the exact same packet that `/ask`\n\nreturns to the frontend, which is how we guarantee the answer and the citation panel never drift.\n\n**3. The ADK Agent (**`agentic_rag_agent/agent.py`\n\n**)**\n\nA short, sharp ADK agent with two tools and a focused instruction:\n\n``` php\ndef retrieve_relevant_context(query: str, top_k: int = 5) -> dict:\n    \"\"\"Retrieve the most relevant multimodal source evidence for a user question.\"\"\"\n    return RAG_STORE.retrieval_tool(query=query, top_k=top_k)\n\ndef inspect_embedding_space() -> dict:\n    \"\"\"Inspect current sources, modalities, dimensions, and embedding provider.\"\"\"\n    return RAG_STORE.space_tool()\n\ndef build_agent(retrieval_tool=retrieve_relevant_context) -> Agent:\n    return Agent(\n        name=\"multimodal_agentic_rag_agent\",\n        model=\"gemini-3-flash-preview\",\n        description=\"Agentic RAG coordinator for a multimodal Gemini Embedding 2 workspace.\",\n        instruction=\"\"\"\nYou are the Google ADK coordinator for a multimodal agentic RAG workspace.\n\nFor every user question:\n1. Use inspect_embedding_space to understand the current workspace.\n2. Use retrieve_relevant_context with the user's question before answering.\n3. Ground the answer in the retrieved evidence. Do not invent facts...\n4. Do not include raw citation ids, source ids, bracket citations...\n5. Start with a clear direct answer in 2-3 sentences.\n6. If helpful, add a short \"Key points:\" section with simple hyphen bullets.\n\"\"\",\n        tools=[inspect_embedding_space, retrieval_tool],\n        generate_content_config=genai_types.GenerateContentConfig(\n            temperature=0.25,\n            max_output_tokens=900,\n        ),\n    )\n```\n\n`build_agent`\n\naccepts an injectable `retrieval_tool`\n\n. That's how `server.py`\n\nswaps in a closure that returns the **already-computed** retrieval packet, instead of letting the agent embed the query a second time.\n\n**4. The FastAPI Server (**`server.py`\n\n**)**\n\nThe endpoint surface is small and predictable:\n\nMethod | Endpoint | What it does |\n|---|---|---|\n|\n| Liveness, ADK availability, dimensions, source counts |\n|\n| Current sources, points, events, projection metadata |\n|\n| Add a text source |\n|\n| Fetch and index a public URL (SSRF-protected) |\n|\n| Upload PDF, image, audio, or video |\n|\n| Remove a source and its chunks |\n|\n| Retrieve once, run ADK answer flow, return citations |\n\nThe key piece is `/ask`\n\n. It retrieves once, builds a clean payload, and injects a closure into the agent so it can't redo the embedding:\n\n``` python\n@app.post(\"/ask\")\nasync def ask(req: AskRequest):\n    retrieval = await run_in_threadpool(RAG_STORE.search, req.question, req.top_k)\n    retrieval_payload = RAG_STORE.retrieval_payload(retrieval)\n    answer = await _run_adk_agent(req.question, retrieval_payload)\n    trace = [\n        {\"agent\": \"space_inspector\",   \"status\": \"complete\", \"detail\": ...},\n        {\"agent\": \"retrieval_tool\",    \"status\": \"complete\", \"detail\": ...},\n        {\"agent\": \"answer_synthesizer\",\"status\": \"complete\", \"detail\": ...},\n    ]\n    return {\n        \"answer\": answer,\n        \"matches\": retrieval[\"matches\"],\n        \"query_point\": retrieval[\"query_point\"],\n        \"trace\": trace,\n        \"space\": retrieval[\"space\"],\n    }\n```\n\nAnd the closure injection inside `_run_adk_agent`\n\n:\n\n``` php\nasync def _run_adk_agent(question: str, retrieval: dict[str, Any]) -> str:\n    def retrieve_relevant_context(query: str, top_k: int = 6) -> dict:\n        \"\"\"Return the exact retrieval packet already embedded for this request.\"\"\"\n        return retrieval\n\n    request_agent = build_agent(retrieve_relevant_context)\n    request_runner = Runner(agent=request_agent, app_name=APP_NAME, session_service=session_service)\n    session = await session_service.create_session(app_name=APP_NAME, user_id=USER_ID)\n    content = genai_types.Content(\n        role=\"user\",\n        parts=[genai_types.Part(text=f\"Question: {question}\\nUse the retrieval tool result for this exact question.\")],\n    )\n    final_text = \"\"\n    async for event in request_runner.run_async(user_id=USER_ID, session_id=session.id, new_message=content):\n        text = _event_text(event)\n        if text:\n            final_text = text\n    return final_text\n```\n\nThe agent thinks it's calling a real retrieval tool. It is — the tool just returns a cached result. This is a clean way to keep agent semantics while skipping a redundant embedding round-trip.\n\nA couple of safety details worth highlighting:\n\n**SSRF protection**:`_validate_fetch_url`\n\nrejects non-HTTP schemes and resolves the hostname; if any returned IP is private, loopback, link-local, or reserved, ingestion fails. Set`ALLOW_PRIVATE_URLS=true`\n\nonly when you really need it.**Threadpool offloading**: every blocking call (text chunking, file reads, search, PCA) runs in`run_in_threadpool`\n\nso the FastAPI event loop stays responsive.**Configurable CORS**:`ALLOWED_ORIGINS`\n\nis read from the env, defaulting to the Vite dev server.\n\n**5. The Frontend (very brief)**\n\nThe frontend is a single React/Vite app (`frontend/src/App.tsx`\n\n) that wraps three panels: a **source manager** for adding text/URLs/files, a **Q&A panel** that calls `/ask`\n\nand renders the answer plus a separate citations list, and a **3D embedding view** built on Three.js that uses the projection coordinates returned by the backend. Every source is one colored point (color encodes modality), and after a question the query point and the cited sources are highlighted in the same PCA basis.\n\n**Running the App**\n\nWith our code in place, it's time to launch the app.\n\nStart the backend\n\n```\npython server.py\n```\n\nThe backend listens on `http://localhost:8897`\n\n.\n\nStart the frontend in a second terminal:\n\n```\ncd multimodal_agentic_rag/frontend\nnpm install\nnpm run dev -- --port 5177\n```\n\nIf your backend lives on a different port, point the frontend at it:\n\n```\nVITE_API_URL=http://localhost:8897 npm run dev -- --port 5177\n```\n\nAdd a few sources — try a paragraph of text, a public URL, a PDF, and an image.\n\nWatch them appear as colored points in the embedding view.\n\nAsk a question in the Q&A panel.\n\nInspect the answer, the cited sources, and the agent trace.\n\nNotice the orange query point land near the sources the agent cites.\n\nA quick health check from the terminal:\n\n```\ncurl http://localhost:8897/health\n```\n\nExpected response shape on a fresh start (the store begins empty):\n\n```\n{\n  \"status\": \"ok\",\n  \"adk\": true,\n  \"setup_error\": \"\",\n  \"sources\": 0,\n  \"chunks\": 0,\n  \"dimensions\": 768,\n  \"provider\": \"gemini-embedding-2\",\n  \"modalities\": {},\n  \"chunk_modalities\": {},\n  \"projection\": \"pca_3d\"\n```\n\n**Working Application Demo**\n\n**Conclusion**\n\nYou've now built a multimodal agentic RAG app that puts text, URLs, PDFs, images, audio, and video into a single Gemini Embedding 2 space, retrieves with cosine similarity over chunked vectors, and uses a tightly-scoped Google ADK agent to write grounded, citation-friendly answers, without a separate vector database, in a few hundred lines of Python.\n\nA few directions worth exploring from here:\n\n**Swap the in-memory store** for a managed vector DB (pgvector, Qdrant, Vertex AI Vector Search) and persist the chunk metadata.**Add re-ranking** with a cross-encoder or a Gemini reranker between cosine retrieval and the agent.**Background ingestion** with a queue (Celery, RQ, or a simple async worker) so large videos don't block the API.**Evals:** wire a small eval set with question/answer pairs and track citation precision and answer faithfulness over changes.**Auth + multi-tenancy** so different users see different workspaces.**Observability**: log the retrieval packet alongside the final answer; the single-retrieval contract makes faithfulness audits straightforward.\n\nKeep experimenting with different configurations and features to build more sophisticated AI applications.\n\nWe share hands-on tutorials like this 2-3 times a week, to help you stay ahead in the world of AI. [If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.](https://www.theunwindai.com/subscribe)", "url": "https://wpnews.pro/news/build-a-multimodal-agentic-rag-app-with-gemini-embedding-2-and-google-adk", "canonical_source": "https://www.theunwindai.com/p/build-a-multimodal-agentic-rag-app-with-gemini-embedding-2-and-google-adk", "published_at": "2026-05-09 23:07:20+00:00", "updated_at": "2026-06-04 14:17:10.945924+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "generative-ai", "ai-agents", "ai-tools"], "entities": ["Gemini Embedding 2", "Google ADK", "Google Agent Development Kit"], "alternates": {"html": "https://wpnews.pro/news/build-a-multimodal-agentic-rag-app-with-gemini-embedding-2-and-google-adk", "markdown": "https://wpnews.pro/news/build-a-multimodal-agentic-rag-app-with-gemini-embedding-2-and-google-adk.md", "text": "https://wpnews.pro/news/build-a-multimodal-agentic-rag-app-with-gemini-embedding-2-and-google-adk.txt", "jsonld": "https://wpnews.pro/news/build-a-multimodal-agentic-rag-app-with-gemini-embedding-2-and-google-adk.jsonld"}}