Analyze Images and PDFs with Google Gemini's Multimodal API in Python

Google's Gemini 1.5 Flash multimodal API can now analyze images and PDFs in Python, returning structured JSON. A new script demonstrates sending a local JPEG and PDF to the model and parsing the response using the google-generativeai SDK.

Analyze Images and PDFs with Google Gemini's Multimodal API in Python Send a photo or a PDF to Gemini 1.5 Flash and get back clean, structured JSON — in under 50 lines of Python. Mariana Souza https://www.devclubhouse.com/u/mariana souza What you'll build A Python script that sends a local JPEG and a PDF to Gemini 1.5 Flash and parses a structured JSON object from each response, using Google's official google-generativeai SDK. Prerequisites - Python 3.9 or newer python --version to check - A Google AI Studio API key — free at aistudio.google.com/app/apikey https://aistudio.google.com/app/apikey - pip 23+ - A sample JPEG and a sample PDF on disk OS note:Commands below use export macOS/Linux . On Windows PowerShell use $env:GEMINI API KEY = "your-key" . Step 1 — Store your API key safely Never hard-code credentials. Export the key as an environment variable in your terminal session: export GEMINI API KEY="your-key-here" Step 2 — Install dependencies pip install "google-generativeai =0.7.0" Pillow Pillow loads local images; the Google SDK handles everything else, including PDF uploads via the File API. Step 3 — Write the script Create gemini multimodal.py : Shadow GPS — know where it is, always Real-time GPS tracking for vehicles, gear and loved ones. No monthly contracts. https://www.devclubhouse.com/go/ad/12 python import json import os import time import PIL.Image import google.generativeai as genai Configure the SDK reads key from the environment — never commit secrets genai.configure api key=os.environ "GEMINI API KEY" Setting response mime type tells Gemini to emit valid JSON every time model = genai.GenerativeModel model name="gemini-1.5-flash", generation config={"response mime type": "application/json"}, def analyze image path: str, prompt: str - dict: """Open a local image JPEG/PNG/WebP and return parsed JSON.""" image = PIL.Image.open path response = model.generate content image, prompt return json.loads response.text def analyze pdf path: str, prompt: str - dict: """Upload a PDF via the File API, query it, then delete the upload.""" uploaded = genai.upload file path Wait for Google's servers to finish processing usually instant for small files while uploaded.state.name == "PROCESSING": time.sleep 2 uploaded = genai.get file uploaded.name response = model.generate content uploaded, prompt genai.delete file uploaded.name optional: remove from Google servers immediately return json.loads response.text if name == " main ": --- Image --- img result = analyze image "sample.jpg", "Return JSON with keys: description, dominant colors list , has people bool .", print "IMAGE:", json.dumps img result, indent=2 --- PDF --- pdf result = analyze pdf "document.pdf", "Return JSON with keys: title, summary one sentence , page count estimate int .", print "PDF:", json.dumps pdf result, indent=2 Why response mime type? Without it, the model sometimes wraps JSON in markdown fences json … , breaking json.loads . This config key forces clean JSON output at the model level. Why upload file for PDFs? PDFs can't be opened with Pillow. The File API accepts application/pdf up to 2 GB and stores the file for up to 48 hours. Step 4 — Run it Place sample.jpg and document.pdf in the same directory, then: python gemini multimodal.py Verify it works Expected output shape values vary by file : IMAGE: { "description": "A golden retriever sitting in a sunny park.", "dominant colors": "yellow", "green", "blue" , "has people": false } PDF: { "title": "Q3 Financial Report", "summary": "Revenue grew 12 % year-over-year driven by cloud services.", "page count estimate": 8 } Both blocks must parse without error via json.loads . If the script exits without exceptions, you're done. Troubleshooting | Error | Cause | Fix | |---|---|---| KeyError: 'GEMINI API KEY' | Environment variable not set in this shell | Re-run export GEMINI API KEY="..." in the same terminal | google.api core.exceptions.InvalidArgument on image | Unsupported format passed to Pillow/Gemini | Use JPEG, PNG, WebP, or GIF; convert BMP/TIFF first | json.JSONDecodeError | Older SDK wrapped JSON in markdown fences | Run pip install -U google-generativeai ; v0.7+ respects response mime type reliably | ResourceExhausted: 429 | Free-tier rate limit 15 req/min for Flash | Wait 60 seconds and retry; reduce call frequency | Next steps Typed schemas — Pass a TypedDict or Pydantic model as response schema inside GenerationConfig SDK ≥ 0.8 to validate field types automatically. Multi-image comparison — Pass a list: img1, img2, "Compare these two images" . Vertex AI — Replace google-generativeai with the vertexai SDK for IAM auth, VPC controls, and enterprise quotas.- Official vision docs: ai.google.dev/gemini-api/docs/vision https://ai.google.dev/gemini-api/docs/vision Mariana Souza https://www.devclubhouse.com/u/mariana souza · Senior Editor Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon. Discussion 0 No comments yet Be the first to weigh in.