{"slug": "analyze-images-and-pdfs-with-google-gemini-s-multimodal-api-in-python", "title": "Analyze Images and PDFs with Google Gemini's Multimodal API in Python", "summary": "Google's Gemini 1.5 Flash multimodal API can now analyze images and PDFs in Python, returning structured JSON. A new script demonstrates sending a local JPEG and PDF to the model and parsing the response using the google-generativeai SDK.", "body_md": "# Analyze Images and PDFs with Google Gemini's Multimodal API in Python\n\nSend a photo or a PDF to Gemini 1.5 Flash and get back clean, structured JSON — in under 50 lines of Python.\n\n[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)\n\n## What you'll build\n\nA Python script that sends a local JPEG and a PDF to Gemini 1.5 Flash and parses a structured JSON object from each response, using Google's official `google-generativeai`\n\nSDK.\n\n## Prerequisites\n\n- Python\n**3.9 or newer**(`python --version`\n\nto check) - A\n**Google AI Studio API key**— free at[aistudio.google.com/app/apikey](https://aistudio.google.com/app/apikey) - pip 23+\n- A sample JPEG and a sample PDF on disk\n\nOS note:Commands below use`export`\n\n(macOS/Linux). On Windows PowerShell use`$env:GEMINI_API_KEY = \"your-key\"`\n\n.\n\n## Step 1 — Store your API key safely\n\nNever hard-code credentials. Export the key as an environment variable in your terminal session:\n\n```\nexport GEMINI_API_KEY=\"your-key-here\"\n```\n\n## Step 2 — Install dependencies\n\n```\npip install \"google-generativeai>=0.7.0\" Pillow\n```\n\n`Pillow`\n\nloads local images; the Google SDK handles everything else, including PDF uploads via the File API.\n\n## Step 3 — Write the script\n\nCreate `gemini_multimodal.py`\n\n:\n\n[Shadow GPS — know where it is, always Real-time GPS tracking for vehicles, gear and loved ones. No monthly contracts.](https://www.devclubhouse.com/go/ad/12)\n\n``` python\nimport json\nimport os\nimport time\n\nimport PIL.Image\nimport google.generativeai as genai\n\n# Configure the SDK (reads key from the environment — never commit secrets)\ngenai.configure(api_key=os.environ[\"GEMINI_API_KEY\"])\n\n# Setting response_mime_type tells Gemini to emit valid JSON every time\nmodel = genai.GenerativeModel(\n    model_name=\"gemini-1.5-flash\",\n    generation_config={\"response_mime_type\": \"application/json\"},\n)\n\ndef analyze_image(path: str, prompt: str) -> dict:\n    \"\"\"Open a local image (JPEG/PNG/WebP) and return parsed JSON.\"\"\"\n    image = PIL.Image.open(path)\n    response = model.generate_content([image, prompt])\n    return json.loads(response.text)\n\ndef analyze_pdf(path: str, prompt: str) -> dict:\n    \"\"\"Upload a PDF via the File API, query it, then delete the upload.\"\"\"\n    uploaded = genai.upload_file(path)\n\n    # Wait for Google's servers to finish processing (usually instant for small files)\n    while uploaded.state.name == \"PROCESSING\":\n        time.sleep(2)\n        uploaded = genai.get_file(uploaded.name)\n\n    response = model.generate_content([uploaded, prompt])\n    genai.delete_file(uploaded.name)   # optional: remove from Google servers immediately\n    return json.loads(response.text)\n\nif __name__ == \"__main__\":\n    # --- Image ---\n    img_result = analyze_image(\n        \"sample.jpg\",\n        \"Return JSON with keys: description, dominant_colors (list), has_people (bool).\",\n    )\n    print(\"IMAGE:\", json.dumps(img_result, indent=2))\n\n    # --- PDF ---\n    pdf_result = analyze_pdf(\n        \"document.pdf\",\n        \"Return JSON with keys: title, summary (one sentence), page_count_estimate (int).\",\n    )\n    print(\"PDF:\", json.dumps(pdf_result, indent=2))\n```\n\n**Why response_mime_type?** Without it, the model sometimes wraps JSON in markdown fences (\n\n```` json … ````\n\n), breaking `json.loads`\n\n. This config key forces clean JSON output at the model level.**Why upload_file for PDFs?** PDFs can't be opened with Pillow. The File API accepts\n\n`application/pdf`\n\nup to 2 GB and stores the file for up to 48 hours.## Step 4 — Run it\n\nPlace `sample.jpg`\n\nand `document.pdf`\n\nin the same directory, then:\n\n```\npython gemini_multimodal.py\n```\n\n## Verify it works\n\nExpected output shape (values vary by file):\n\n```\nIMAGE: {\n  \"description\": \"A golden retriever sitting in a sunny park.\",\n  \"dominant_colors\": [\"yellow\", \"green\", \"blue\"],\n  \"has_people\": false\n}\nPDF: {\n  \"title\": \"Q3 Financial Report\",\n  \"summary\": \"Revenue grew 12 % year-over-year driven by cloud services.\",\n  \"page_count_estimate\": 8\n}\n```\n\nBoth blocks must parse without error via `json.loads`\n\n. If the script exits without exceptions, you're done.\n\n## Troubleshooting\n\n| Error | Cause | Fix |\n|---|---|---|\n`KeyError: 'GEMINI_API_KEY'` |\nEnvironment variable not set in this shell | Re-run `export GEMINI_API_KEY=\"...\"` in the same terminal |\n`google.api_core.exceptions.InvalidArgument` on image |\nUnsupported format passed to Pillow/Gemini | Use JPEG, PNG, WebP, or GIF; convert BMP/TIFF first |\n`json.JSONDecodeError` |\nOlder SDK wrapped JSON in markdown fences | Run `pip install -U google-generativeai` ; v0.7+ respects `response_mime_type` reliably |\n`ResourceExhausted: 429` |\nFree-tier rate limit (15 req/min for Flash) | Wait 60 seconds and retry; reduce call frequency |\n\n## Next steps\n\n**Typed schemas**— Pass a`TypedDict`\n\nor Pydantic model as`response_schema`\n\ninside`GenerationConfig`\n\n(SDK ≥ 0.8) to validate field types automatically.**Multi-image comparison**— Pass a list:`[img1, img2, \"Compare these two images\"]`\n\n.**Vertex AI**— Replace`google-generativeai`\n\nwith the`vertexai`\n\nSDK for IAM auth, VPC controls, and enterprise quotas.- Official vision docs:\n[ai.google.dev/gemini-api/docs/vision](https://ai.google.dev/gemini-api/docs/vision)\n\n[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)· Senior Editor\n\nMariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.\n\n## Discussion 0\n\nNo comments yet\n\nBe the first to weigh in.", "url": "https://wpnews.pro/news/analyze-images-and-pdfs-with-google-gemini-s-multimodal-api-in-python", "canonical_source": "https://www.devclubhouse.com/a/analyze-images-and-pdfs-with-google-geminis-multimodal-api-in-python", "published_at": "2026-06-21 07:34:20+00:00", "updated_at": "2026-06-21 07:38:59.576010+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "generative-ai", "computer-vision", "natural-language-processing"], "entities": ["Google", "Gemini", "Google AI Studio", "Pillow", "Mariana Souza"], "alternates": {"html": "https://wpnews.pro/news/analyze-images-and-pdfs-with-google-gemini-s-multimodal-api-in-python", "markdown": "https://wpnews.pro/news/analyze-images-and-pdfs-with-google-gemini-s-multimodal-api-in-python.md", "text": "https://wpnews.pro/news/analyze-images-and-pdfs-with-google-gemini-s-multimodal-api-in-python.txt", "jsonld": "https://wpnews.pro/news/analyze-images-and-pdfs-with-google-gemini-s-multimodal-api-in-python.jsonld"}}