Send a photo or a PDF to Gemini 1.5 Flash and get back clean, structured JSON — in under 50 lines of Python.
What you'll build #
A Python script that sends a local JPEG and a PDF to Gemini 1.5 Flash and parses a structured JSON object from each response, using Google's official google-generativeai
SDK.
Prerequisites #
- Python
3.9 or newer(
python --version
to check) - A Google AI Studio API key— free ataistudio.google.com/app/apikey - pip 23+
- A sample JPEG and a sample PDF on disk
OS note:Commands below useexport
(macOS/Linux). On Windows PowerShell use$env:GEMINI_API_KEY = "your-key"
.
Step 1 — Store your API key safely #
Never hard-code credentials. Export the key as an environment variable in your terminal session:
export GEMINI_API_KEY="your-key-here"
Step 2 — Install dependencies #
pip install "google-generativeai>=0.7.0" Pillow
Pillow
loads local images; the Google SDK handles everything else, including PDF uploads via the File API.
Step 3 — Write the script #
Create gemini_multimodal.py
:
import json
import os
import time
import PIL.Image
import google.generativeai as genai
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel(
model_name="gemini-1.5-flash",
generation_config={"response_mime_type": "application/json"},
)
def analyze_image(path: str, prompt: str) -> dict:
"""Open a local image (JPEG/PNG/WebP) and return parsed JSON."""
image = PIL.Image.open(path)
response = model.generate_content([image, prompt])
return json.loads(response.text)
def analyze_pdf(path: str, prompt: str) -> dict:
"""Upload a PDF via the File API, query it, then delete the upload."""
uploaded = genai.upload_file(path)
while uploaded.state.name == "PROCESSING":
time.sleep(2)
uploaded = genai.get_file(uploaded.name)
response = model.generate_content([uploaded, prompt])
genai.delete_file(uploaded.name) # optional: remove from Google servers immediately
return json.loads(response.text)
if __name__ == "__main__":
img_result = analyze_image(
"sample.jpg",
"Return JSON with keys: description, dominant_colors (list), has_people (bool).",
)
print("IMAGE:", json.dumps(img_result, indent=2))
pdf_result = analyze_pdf(
"document.pdf",
"Return JSON with keys: title, summary (one sentence), page_count_estimate (int).",
)
print("PDF:", json.dumps(pdf_result, indent=2))
Why response_mime_type? Without it, the model sometimes wraps JSON in markdown fences (
json …
), breaking json.loads
. This config key forces clean JSON output at the model level.Why upload_file for PDFs? PDFs can't be opened with Pillow. The File API accepts
application/pdf
up to 2 GB and stores the file for up to 48 hours.## Step 4 — Run it
Place sample.jpg
and document.pdf
in the same directory, then:
python gemini_multimodal.py
Verify it works #
Expected output shape (values vary by file):
IMAGE: {
"description": "A golden retriever sitting in a sunny park.",
"dominant_colors": ["yellow", "green", "blue"],
"has_people": false
}
PDF: {
"title": "Q3 Financial Report",
"summary": "Revenue grew 12 % year-over-year driven by cloud services.",
"page_count_estimate": 8
}
Both blocks must parse without error via json.loads
. If the script exits without exceptions, you're done.
Troubleshooting #
| Error | Cause | Fix |
|---|---|---|
KeyError: 'GEMINI_API_KEY' |
||
| Environment variable not set in this shell | Re-run export GEMINI_API_KEY="..." in the same terminal |
|
google.api_core.exceptions.InvalidArgument on image |
||
| Unsupported format passed to Pillow/Gemini | Use JPEG, PNG, WebP, or GIF; convert BMP/TIFF first | |
json.JSONDecodeError |
||
| Older SDK wrapped JSON in markdown fences | Run pip install -U google-generativeai ; v0.7+ respects response_mime_type reliably |
|
ResourceExhausted: 429 |
||
| Free-tier rate limit (15 req/min for Flash) | Wait 60 seconds and retry; reduce call frequency |
Next steps #
Typed schemas— Pass aTypedDict
or Pydantic model asresponse_schema
insideGenerationConfig
(SDK ≥ 0.8) to validate field types automatically.Multi-image comparison— Pass a list:[img1, img2, "Compare these two images"]
.Vertex AI— Replacegoogle-generativeai
with thevertexai
SDK for IAM auth, VPC controls, and enterprise quotas.- Official vision docs: ai.google.dev/gemini-api/docs/vision
Mariana Souza· Senior Editor
Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.
Discussion 0 #
No comments yet
Be the first to weigh in.