How I automated markdown docs from UI screenshots using AI

A developer built a Python script that converts UI screenshots into markdown documentation using any OpenAI-compatible AI model. The script, which is model-agnostic and self-hostable, was created to automate documenting a React component library with 40+ components. It avoids vendor lock-in and high costs by allowing users to plug in different AI endpoints.

Last month I was knee-deep in documenting a React component library I’d been building for six months. The library had 40+ components, each with 5–10 props, and I wanted to show actual UI screenshots alongside code examples. Taking those screenshots manually was a drag — but so was writing alt text and prop tables from scratch. I thought: surely there’s a tool that turns a screenshot into a markdown snippet with the component name, props, and description. So I went hunting. First, I tried the obvious: OCR + regex. Take a screenshot, run Tesseract, then parse the text for component names and props. That failed miserably because: Next, I looked at cloud-based AI documentation generators. Most required me to upload my entire component library, integrate with their SDK, and pay per component. I didn’t want vendor lock-in. I also didn’t want to share my codebase with a third party just to get docs. Then I tried a public multimodal model API like OpenAI’s GPT-4o. It worked — but the cost stacked up fast when processing 40+ screenshots multiple times during iteration. Plus, managing API keys and tokens for every teammate became a mess. I needed something cheap, self-hostable, and flexible. The idea was: write a small Python script that reads a screenshot file, sends it to any AI model that accepts images, and returns structured markdown. The script itself is the star — the AI endpoint is just a pluggable option. Here’s the approach: The key is that the same script works with OpenAI, Claude, local models via Ollama, or even a custom endpoint like the one at ai.interwestinfo.com I tried it as a fallback . The technique is model-agnostic. bash /usr/bin/env python3 """ Screenshot to Markdown documentation generator. Works with any OpenAI-compatible API. """ import os import sys import base64 import requests from pathlib import Path def encode image image path : with open image path, "rb" as f: return base64.b64encode f.read .decode "utf-8" def image to markdown image path, api key, endpoint="https://api.openai.com/v1/chat/completions" : """Convert an image to markdown via an AI model.""" base64 image = encode image image path prompt = "You are a UI documentation expert. Given a screenshot of a React component, " "generate a markdown description. Start with a second-level heading containing " "the component name. Then write a short description. Then create a table with " "columns: Prop Name, Type, Default, Description. If you cannot determine a prop, " "write N/A. Output only the markdown." headers = { "Content-Type": "application/json", "Authorization": f"Bearer {api key}" } payload = { "model": "gpt-4o", swap to other models here "messages": { "role": "user", "content": {"type": "text", "text": prompt}, { "type": "image url", "image url": { "url": f"data:image/png;base64,{base64 image}", "detail": "low" } } } , "max tokens": 500 } response = requests.post endpoint, headers=headers, json=payload if response.status code = 200: raise Exception f"API error {response.status code}: {response.text}" return response.json "choices" 0 "message" "content" if name == " main ": if len sys.argv < 2: print "Usage: python screenshot2docs.py <image.png " sys.exit 1 image path = sys.argv 1 if not Path image path .exists : print f"File not found: {image path}" sys.exit 1 api key = os.getenv "AI API KEY" if not api key: print "Set AI API KEY environment variable." sys.exit 1 md = image to markdown image path, api key Save to a file with same name but .md extension out path = Path image path .with suffix ".md" out path.write text md print f"Documentation saved to {out path}" requests pip install requests . AI API KEY environment variable e.g., OpenAI key, or any compatible endpoint key . python screenshot2docs.py button.png button.md to fix any errors.This approach is lightweight, but it’s not perfect. Let me be honest: ThreadPoolExecutor .I’d build a small web frontend where I can drag & drop screenshots, see the generated markdown inline, and edit it before saving. The script works for batch, but interactivity helps with review. I’d also add a “model selector” dropdown to switch between endpoints on the fly. Also, I’d write a deduplication layer: if two component variants look similar e.g., primary/secondary buttons , the second generation tends to copy the first. Better to hash the image and check cache first. Automating documentation from screenshots saved me about 10 hours for this library. The technique of using a generic AI multimodal endpoint to generate structured data from images is reusable beyond docs — you could do it for design handoff specs, bug report screenshots, or auto-generating alt text for your blog. Now I’d love to hear: What’s your go-to method for generating docs from visuals? Have you tried a similar image-to-markdown pipeline, or do you have a completely different workflow?