# Analyze Images and PDFs with Google Gemini's Multimodal API in Python

> Source: <https://www.devclubhouse.com/a/analyze-images-and-pdfs-with-google-geminis-multimodal-api-in-python>
> Published: 2026-06-21 07:34:20+00:00

# Analyze Images and PDFs with Google Gemini's Multimodal API in Python

Send a photo or a PDF to Gemini 1.5 Flash and get back clean, structured JSON — in under 50 lines of Python.

[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)

## What you'll build

A Python script that sends a local JPEG and a PDF to Gemini 1.5 Flash and parses a structured JSON object from each response, using Google's official `google-generativeai`

SDK.

## Prerequisites

- Python
**3.9 or newer**(`python --version`

to check) - A
**Google AI Studio API key**— free at[aistudio.google.com/app/apikey](https://aistudio.google.com/app/apikey) - pip 23+
- A sample JPEG and a sample PDF on disk

OS note:Commands below use`export`

(macOS/Linux). On Windows PowerShell use`$env:GEMINI_API_KEY = "your-key"`

.

## Step 1 — Store your API key safely

Never hard-code credentials. Export the key as an environment variable in your terminal session:

```
export GEMINI_API_KEY="your-key-here"
```

## Step 2 — Install dependencies

```
pip install "google-generativeai>=0.7.0" Pillow
```

`Pillow`

loads local images; the Google SDK handles everything else, including PDF uploads via the File API.

## Step 3 — Write the script

Create `gemini_multimodal.py`

:

[Shadow GPS — know where it is, always Real-time GPS tracking for vehicles, gear and loved ones. No monthly contracts.](https://www.devclubhouse.com/go/ad/12)

``` python
import json
import os
import time

import PIL.Image
import google.generativeai as genai

# Configure the SDK (reads key from the environment — never commit secrets)
genai.configure(api_key=os.environ["GEMINI_API_KEY"])

# Setting response_mime_type tells Gemini to emit valid JSON every time
model = genai.GenerativeModel(
    model_name="gemini-1.5-flash",
    generation_config={"response_mime_type": "application/json"},
)

def analyze_image(path: str, prompt: str) -> dict:
    """Open a local image (JPEG/PNG/WebP) and return parsed JSON."""
    image = PIL.Image.open(path)
    response = model.generate_content([image, prompt])
    return json.loads(response.text)

def analyze_pdf(path: str, prompt: str) -> dict:
    """Upload a PDF via the File API, query it, then delete the upload."""
    uploaded = genai.upload_file(path)

    # Wait for Google's servers to finish processing (usually instant for small files)
    while uploaded.state.name == "PROCESSING":
        time.sleep(2)
        uploaded = genai.get_file(uploaded.name)

    response = model.generate_content([uploaded, prompt])
    genai.delete_file(uploaded.name)   # optional: remove from Google servers immediately
    return json.loads(response.text)

if __name__ == "__main__":
    # --- Image ---
    img_result = analyze_image(
        "sample.jpg",
        "Return JSON with keys: description, dominant_colors (list), has_people (bool).",
    )
    print("IMAGE:", json.dumps(img_result, indent=2))

    # --- PDF ---
    pdf_result = analyze_pdf(
        "document.pdf",
        "Return JSON with keys: title, summary (one sentence), page_count_estimate (int).",
    )
    print("PDF:", json.dumps(pdf_result, indent=2))
```

**Why response_mime_type?** Without it, the model sometimes wraps JSON in markdown fences (

```` json … ````

), breaking `json.loads`

. This config key forces clean JSON output at the model level.**Why upload_file for PDFs?** PDFs can't be opened with Pillow. The File API accepts

`application/pdf`

up to 2 GB and stores the file for up to 48 hours.## Step 4 — Run it

Place `sample.jpg`

and `document.pdf`

in the same directory, then:

```
python gemini_multimodal.py
```

## Verify it works

Expected output shape (values vary by file):

```
IMAGE: {
  "description": "A golden retriever sitting in a sunny park.",
  "dominant_colors": ["yellow", "green", "blue"],
  "has_people": false
}
PDF: {
  "title": "Q3 Financial Report",
  "summary": "Revenue grew 12 % year-over-year driven by cloud services.",
  "page_count_estimate": 8
}
```

Both blocks must parse without error via `json.loads`

. If the script exits without exceptions, you're done.

## Troubleshooting

| Error | Cause | Fix |
|---|---|---|
`KeyError: 'GEMINI_API_KEY'` |
Environment variable not set in this shell | Re-run `export GEMINI_API_KEY="..."` in the same terminal |
`google.api_core.exceptions.InvalidArgument` on image |
Unsupported format passed to Pillow/Gemini | Use JPEG, PNG, WebP, or GIF; convert BMP/TIFF first |
`json.JSONDecodeError` |
Older SDK wrapped JSON in markdown fences | Run `pip install -U google-generativeai` ; v0.7+ respects `response_mime_type` reliably |
`ResourceExhausted: 429` |
Free-tier rate limit (15 req/min for Flash) | Wait 60 seconds and retry; reduce call frequency |

## Next steps

**Typed schemas**— Pass a`TypedDict`

or Pydantic model as`response_schema`

inside`GenerationConfig`

(SDK ≥ 0.8) to validate field types automatically.**Multi-image comparison**— Pass a list:`[img1, img2, "Compare these two images"]`

.**Vertex AI**— Replace`google-generativeai`

with the`vertexai`

SDK for IAM auth, VPC controls, and enterprise quotas.- Official vision docs:
[ai.google.dev/gemini-api/docs/vision](https://ai.google.dev/gemini-api/docs/vision)

[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)· Senior Editor

Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.

## Discussion 0

No comments yet

Be the first to weigh in.
