Analyze Images and PDFs with Google Gemini's Multimodal API in Python

wpnews.pro

cd /news/artificial-intelligence/analyze-images-and-pdfs-with-google-… · home › topics › artificial-intelligence › article

[ARTICLE · art-35386] src=devclubhouse.com ↗ pub=2026-06-21T07:34Z topic=artificial-intelligence verified=true sentiment=↑ positive

Analyze Images and PDFs with Google Gemini's Multimodal API in Python

Google's Gemini 1.5 Flash multimodal API can now analyze images and PDFs in Python, returning structured JSON. A new script demonstrates sending a local JPEG and PDF to the model and parsing the response using the google-generativeai SDK.

read4 min views1 publishedJun 21, 2026

Analyze Images and PDFs with Google Gemini's Multimodal API in Python — Image: Devclubhouse (auto-discovered)

Send a photo or a PDF to Gemini 1.5 Flash and get back clean, structured JSON — in under 50 lines of Python.

Mariana Souza

What you'll build #

A Python script that sends a local JPEG and a PDF to Gemini 1.5 Flash and parses a structured JSON object from each response, using Google's official google-generativeai

SDK.

Prerequisites #

Python 3.9 or newer(python --version

to check) - A Google AI Studio API key— free ataistudio.google.com/app/apikey - pip 23+

A sample JPEG and a sample PDF on disk

OS note:Commands below useexport

(macOS/Linux). On Windows PowerShell use$env:GEMINI_API_KEY = "your-key"

Step 1 — Store your API key safely #

Never hard-code credentials. Export the key as an environment variable in your terminal session:

export GEMINI_API_KEY="your-key-here"

Step 2 — Install dependencies #

pip install "google-generativeai>=0.7.0" Pillow

Pillow

loads local images; the Google SDK handles everything else, including PDF uploads via the File API.

Step 3 — Write the script #

Create gemini_multimodal.py

Shadow GPS — know where it is, always Real-time GPS tracking for vehicles, gear and loved ones. No monthly contracts.

import json
import os
import time

import PIL.Image
import google.generativeai as genai

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

model = genai.GenerativeModel(
    model_name="gemini-1.5-flash",
    generation_config={"response_mime_type": "application/json"},
)

def analyze_image(path: str, prompt: str) -> dict:
    """Open a local image (JPEG/PNG/WebP) and return parsed JSON."""
    image = PIL.Image.open(path)
    response = model.generate_content([image, prompt])
    return json.loads(response.text)

def analyze_pdf(path: str, prompt: str) -> dict:
    """Upload a PDF via the File API, query it, then delete the upload."""
    uploaded = genai.upload_file(path)

    while uploaded.state.name == "PROCESSING":
        time.sleep(2)
        uploaded = genai.get_file(uploaded.name)

    response = model.generate_content([uploaded, prompt])
    genai.delete_file(uploaded.name)   # optional: remove from Google servers immediately
    return json.loads(response.text)

if __name__ == "__main__":
    img_result = analyze_image(
        "sample.jpg",
        "Return JSON with keys: description, dominant_colors (list), has_people (bool).",
    )
    print("IMAGE:", json.dumps(img_result, indent=2))

    pdf_result = analyze_pdf(
        "document.pdf",
        "Return JSON with keys: title, summary (one sentence), page_count_estimate (int).",
    )
    print("PDF:", json.dumps(pdf_result, indent=2))

Why response_mime_type? Without it, the model sometimes wraps JSON in markdown fences (

json …

), breaking json.loads

. This config key forces clean JSON output at the model level.Why upload_file for PDFs? PDFs can't be opened with Pillow. The File API accepts

application/pdf

up to 2 GB and stores the file for up to 48 hours.## Step 4 — Run it

Place sample.jpg

and document.pdf

in the same directory, then:

python gemini_multimodal.py

Verify it works #

Expected output shape (values vary by file):

IMAGE: {
  "description": "A golden retriever sitting in a sunny park.",
  "dominant_colors": ["yellow", "green", "blue"],
  "has_people": false
}
PDF: {
  "title": "Q3 Financial Report",
  "summary": "Revenue grew 12 % year-over-year driven by cloud services.",
  "page_count_estimate": 8
}

Both blocks must parse without error via json.loads

. If the script exits without exceptions, you're done.

Troubleshooting #

Error	Cause	Fix
`KeyError: 'GEMINI_API_KEY'`
Environment variable not set in this shell	Re-run `export GEMINI_API_KEY="..."` in the same terminal
`google.api_core.exceptions.InvalidArgument` on image
Unsupported format passed to Pillow/Gemini	Use JPEG, PNG, WebP, or GIF; convert BMP/TIFF first
`json.JSONDecodeError`
Older SDK wrapped JSON in markdown fences	Run `pip install -U google-generativeai` ; v0.7+ respects `response_mime_type` reliably
`ResourceExhausted: 429`
Free-tier rate limit (15 req/min for Flash)	Wait 60 seconds and retry; reduce call frequency

Next steps #

Typed schemas— Pass aTypedDict

or Pydantic model asresponse_schema

insideGenerationConfig

(SDK ≥ 0.8) to validate field types automatically.Multi-image comparison— Pass a list:[img1, img2, "Compare these two images"]

.Vertex AI— Replacegoogle-generativeai

with thevertexai

SDK for IAM auth, VPC controls, and enterprise quotas.- Official vision docs: ai.google.dev/gemini-api/docs/vision

Mariana Souza· Senior Editor

Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.

Discussion 0 #

No comments yet

Be the first to weigh in.

source & further reading

devclubhouse.com — original article Gemma 4 12B: The Encoder-Free Shift to Local Multimodal Agents Beyond Refusal: The Rise of Agentic AI Penetration Testing The Agentic Sysadmin: Analyzing Cloudflare’s Temporary Accounts for AI

~/api · this article 200

$curl api.wpnews.pro/v1/news/analyze-images-and-pdfs-…

Read original on devclubhouse.com → www.devclubhouse.com/a/analyze-images-and-pdfs-w…

mentioned entities

Google

Gemini

Google AI Studio

Pillow

Mariana Souza

metadata

sluganalyze-images-and-pdfs-with-google-gemini-s-multimodal-api-in-python

topic#artificial-intelligence

secondary4 topics

sentimentpositive

canonicaldevclubhouse.com

navigation

← prevLBE – open-source execution cont…

next →Visual Studio Code 1.126

── more in #artificial-intelligence 4 stories · sorted by recency

artsandculture.google.com · 21 Jun · #artificial-intelligence

See in CMYK

dev.to · 21 Jun · #artificial-intelligence

How AI engines actually decide what to cite (ChatGPT, Perplexity, Gemini, AI Overviews)

dev.to · 21 Jun · #artificial-intelligence

Goal In, DAG Out: How Open-Multi-Agent Turns a Goal into a Task DAG

dev.to · 21 Jun · #artificial-intelligence

"EcoSphere AI: Why I separated 'logic' from 'AI' when building a carbon footprint assistant"

── more on @google 3 stories trending now

wpnews · 20 Jun · #ai-agents

Amazon Bedrock AgentCore Memory: Build AI Agents That Remember

wpnews · 20 Jun · #ai-safety

SR 11-7 Model Risk for AI Systems: What Banks Actually Need to Build

wpnews · 20 Jun · #artificial-intelligence

Microsoft is rewriting the economics of enterprise AI and the bill shock is just getting started

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required