Building a Multimodal AI Pipeline: Text Image Text Across Three Providers

A developer built a multimodal AI pipeline in under 55 lines of Python using yait_aichain's Skill and Model primitives, chaining Claude, an image generation model, and Qwen Vision across three providers (Anthropic, xAI, DashScope). The pipeline generates a sunset description with Claude, creates an image from that text, and then analyzes the result with Qwen Vision, demonstrating unified multimodal handling through a parts list structure.

Three providers, three modalities, under 55 lines of Python — and a PNG file on disk at the end. Claude writes a sunset description, an image generation model paints it, and Qwen Vision analyzes the result. Each model does one thing well; the script wires them together. This article walks through building exactly that pipeline using yait aichain's Skill and Model primitives. We'll go step by step: generate text with Claude, turn that text into an image, then feed the image to Qwen Vision for analysis. The pipeline has three stages: claude-3-5-sonnet-20241022 imagine-image-pro qwen-vl-max Each stage uses a different provider — Anthropic, xAI, and DashScope. The output of one stage becomes the input of the next. You need three API keys, each set as an environment variable: export ANTHROPIC API KEY="your-anthropic-key" export XAI API KEY="your-xai-key" export DASHSCOPE API KEY="your-dashscope-key" Install the library: pip install yait aichain No extra dependencies for image handling — Python's base64 and pathlib modules cover the file I/O. yait aichain handles provider routing internally, so you won't need to install Anthropic, xAI, or DashScope SDKs separately. Model represents a connection to a specific model at a specific provider. You pass the model name and an API key — no provider-specific client classes, no adapter patterns to memorize. Skill is a single unit of work. It takes a Model , an input structured as messages , and optionally an output configuration. Call .run and it executes. The message format uses a parts list inside each message, which is how yait aichain handles multimodal content uniformly — text, images, and mixed content all go through the same structure. python import os, sys, base64, pathlib from yait aichain import Model, Skill text skill = Skill model = Model "claude-3-5-sonnet-20241022", api key=os.environ "ANTHROPIC API KEY" , input = {"messages": {"role": "user", "parts": "Describe a sunset in one sentence." } }, description = text skill.run print f" text → text · Claude \n{description}\n" The input dictionary contains a messages list — identical in shape to what you'd see in a chat API. Each message has a role and a parts list. For plain text, parts is just a list of strings. Notice the use of os.environ "KEY" rather than os.getenv "KEY" . This is a deliberate choice I prefer for multi-provider scripts: os.getenv silently returns None when a key is missing, which pushes the error down to the provider's API where the message is far less useful. os.environ raises a KeyError immediately with the variable name. When you're juggling three different API keys for the first time, you want to know which one is missing. text skill.run returns the model's response as a string. On a typical call, you'll get something like: "The sun melted into the horizon, painting the sky in layered bands of amber, rose, and deep violet as the ocean mirrored its fading warmth." That string becomes the input for Stage 2. parts Instead of content ? The parts list is the design decision that makes multimodal work without special-casing. A text-only message uses "some string" . A message with an image uses a dictionary inside parts . A message with both uses both. Same field, same structure, every modality. We take Claude's text output and pass it to the image generation model as a prompt: image skill = Skill model = Model "imagine-image-pro", api key=os.environ "XAI API KEY" , input = {"messages": {"role": "user", "parts": description } }, output = {"modalities": "image" , "format": {"type": "image", "size": "1024x1024"}}, image = image skill.run img path = pathlib.Path "output sunset.png" img path.write bytes base64.b64decode image "base64" print f" text → image \nsaved → {img path}\n" Two things to notice here. The output configuration. This is the first time we specify how the response should come back. "modalities": "image" tells the Skill we expect an image. The "format" dictionary specifies the type and dimensions. Without this, the model might return text describing how it The return value. When a Skill produces an image, .run returns a dictionary with at least two keys: "base64" the image data and "mime type" e.g., "image/png" . We decode the base64 data and write it to disk. pathlib.Path "output sunset.png" writes to the current working directory rather than using file . That's deliberate — file is undefined in interactive environments like Jupyter notebooks or a REPL and raises a NameError . A relative path works consistently across all contexts. "1024x1024" is a common default for image generation models. If you pass a size the model doesn't support, you'll get an error at runtime rather than a silently resized image. Check your provider's documentation for supported dimensions before you assume. The image from Stage 2 goes into Qwen's vision-language model: vision skill = Skill model = Model "qwen-vl-max", api key=os.environ "DASHSCOPE API KEY" , input = { "messages": { "role": "user", "parts": {"type": "image", "source": {"kind": "base64", "data": image "base64" , "mime": image "mime type" }}, {"type": "text", "text": "What do you see in this image?"}, , } }, analysis = vision skill.run print f" image → text · Qwen \n{analysis}" The parts list now contains two items: An image part — a dictionary with "type": "image" and a "source" object. The source specifies "kind": "base64" , the actual base64 data, and the MIME type — both pulled directly from Stage 2's output dictionary. A text part — a dictionary with "type": "text" and the question. Same parts structure as Stage 1. The only difference is that instead of bare strings, we use typed dictionaries to describe each piece of content. The vision model receives the image and the question in a single message and Qwen's response comes back as a plain string — something like: "The image shows a vivid sunset over an ocean. The sky displays gradients of orange, pink, and purple. The sun is partially below the horizon, with its reflection stretching across calm water." """ Multimodal pipeline: Text → Image → Text, three different providers. 1. text → text Claude claude-3-5-sonnet-20241022 2. text → image imagine-image-pro 3. image → text Qwen qwen-vl-max Required env vars: ANTHROPIC API KEY XAI API KEY DASHSCOPE API KEY """ import os, sys, base64, pathlib from yait aichain import Model, Skill ── 1. Text → Text Claude ────────────────────────────────────────────────── text skill = Skill model = Model "claude-3-5-sonnet-20241022", api key=os.environ "ANTHROPIC API KEY" , input = {"messages": {"role": "user", "parts": "Describe a sunset in one sentence." } }, try: description = text skill.run except Exception as e: print f"Stage 1 failed: {e}" ; sys.exit 1 print f" text → text · Claude \n{description}\n" ── 2. Text → Image ────────────────────────────────────────────────────────── image skill = Skill model = Model "imagine-image-pro", api key=os.environ "XAI API KEY" , input = {"messages": {"role": "user", "parts": description } }, output = {"modalities": "image" , "format": {"type": "image", "size": "1024x1024"}}, try: image = image skill.run except Exception as e: print f"Stage 2 failed: {e}" ; sys.exit 1 img path = pathlib.Path "output sunset.png" img path.write bytes base64.b64decode image "base64" print f" text → image \nsaved → {img path}\n" ── 3. Image → Text Qwen Vision ──────────────────────────────────────────── vision skill = Skill model = Model "qwen-vl-max", api key=os.environ "DASHSCOPE API KEY" , input = { "messages": { "role": "user", "parts": {"type": "image", "source": {"kind": "base64", "data": image "base64" , "mime": image "mime type" }}, {"type": "text", "text": "What do you see in this image?"}, , } }, try: analysis = vision skill.run except Exception as e: print f"Stage 3 failed: {e}" ; sys.exit 1 print f" image → text · Qwen \n{analysis}" Three providers. Two modality transitions. Each stage wrapped in its own try/except so a failure at Stage 2 tells you it was Stage 2 — not a cryptic traceback from somewhere inside a provider SDK you didn't even know you were calling. There's no special "chaining" API. The variable description a string goes directly into image skill 's input. The variable image a dictionary gets its fields plucked out for vision skill 's input. Regular Python variables carry data between stages. When you need to transform data between stages — truncating a description to 200 characters before image generation, for instance — you write normal Python between the calls. No callbacks, no middleware, no pipeline DSL. This is actually one of the things I like about this approach: the "pipeline" is just a script. The parts list is what keeps the interface uniform across modalities: "parts": "your string here" "parts": {"type": "image", "source": {...}} "parts": image dict, text dict One structure, every model, every modality. Notice what's absent from the Skill configurations: no Anthropic client initialization, no provider-specific headers, no DashScope SDK imports. The Model constructor takes a model name and an API key; provider routing happens internally. Swapping the image generation model means changing one string and one environment variable — nothing else in the script changes. Once you have this pattern, extensions are straightforward. Skill , another Model , same shape.The models do the hard work. The code connects them — and stays out of the way.