Building a Multimodal AI Pipeline: Text Image Text Across Three Providers

wpnews.pro

Three providers, three modalities, under 55 lines of Python — and a PNG file on disk at the end. Claude writes a sunset description, an image generation model paints it, and Qwen Vision analyzes the result. Each model does one thing well; the script wires them together.

This article walks through building exactly that pipeline using yait_aichain's Skill

and Model

primitives. We'll go step by step: generate text with Claude, turn that text into an image, then feed the image to Qwen Vision for analysis.

The pipeline has three stages:

claude-3-5-sonnet-20241022

)imagine-image-pro

)qwen-vl-max

)Each stage uses a different provider — Anthropic, xAI, and DashScope. The output of one stage becomes the input of the next.

You need three API keys, each set as an environment variable:

export ANTHROPIC_API_KEY="your-anthropic-key"
export XAI_API_KEY="your-xai-key"
export DASHSCOPE_API_KEY="your-dashscope-key"

Install the library:

pip install yait_aichain

No extra dependencies for image handling — Python's base64

and pathlib

modules cover the file I/O. yait_aichain handles provider routing internally, so you won't need to install Anthropic, xAI, or DashScope SDKs separately.

** Model** represents a connection to a specific model at a specific provider. You pass the model name and an API key — no provider-specific client classes, no adapter patterns to memorize.

** Skill** is a single unit of work. It takes a

Model

, an input

(structured as messages), and optionally an output

configuration. Call .run()

and it executes. The message format uses a parts

list inside each message, which is how yait_aichain handles multimodal content uniformly — text, images, and mixed content all go through the same structure.

import os, sys, base64, pathlib
from yait_aichain import Model, Skill

text_skill = Skill(
    model = Model("claude-3-5-sonnet-20241022", api_key=os.environ["ANTHROPIC_API_KEY"]),
    input = {"messages": [{"role": "user", "parts": ["Describe a sunset in one sentence."]}]},
)

description = text_skill.run()
print(f"[text → text · Claude]\n{description}\n")

The input

dictionary contains a messages

list — identical in shape to what you'd see in a chat API. Each message has a role

and a parts

list. For plain text, parts

is just a list of strings.

Notice the use of os.environ["KEY"]

rather than os.getenv("KEY")

. This is a deliberate choice I prefer for multi-provider scripts: os.getenv

silently returns None

when a key is missing, which pushes the error down to the provider's API where the message is far less useful. os.environ

raises a KeyError

immediately with the variable name. When you're juggling three different API keys for the first time, you want to know which one is missing.

text_skill.run()

returns the model's response as a string. On a typical call, you'll get something like:

"The sun melted into the horizon, painting the sky in layered bands of amber, rose, and deep violet as the ocean mirrored its fading warmth."

That string becomes the input for Stage 2.

parts

Instead of content

? The parts

list is the design decision that makes multimodal work without special-casing. A text-only message uses ["some string"]

. A message with an image uses a dictionary inside parts

. A message with both uses both. Same field, same structure, every modality.

We take Claude's text output and pass it to the image generation model as a prompt:

image_skill = Skill(
    model  = Model("imagine-image-pro", api_key=os.environ["XAI_API_KEY"]),
    input  = {"messages": [{"role": "user", "parts": [description]}]},
    output = {"modalities": ["image"], "format": {"type": "image", "size": "1024x1024"}},
)

image    = image_skill.run()
img_path = pathlib.Path("output_sunset.png")
img_path.write_bytes(base64.b64decode(image["base64"]))
print(f"[text → image]\nsaved → {img_path}\n")

Two things to notice here.

The output configuration. This is the first time we specify how the response should come back.

"modalities": ["image"]

tells the Skill we expect an image. The "format"

dictionary specifies the type and dimensions. Without this, the model might return text describing how it The return value. When a Skill produces an image, .run()

returns a dictionary with at least two keys: "base64"

(the image data) and "mime_type"

(e.g., "image/png"

). We decode the base64 data and write it to disk.

pathlib.Path("output_sunset.png")

writes to the current working directory rather than using __file__

. That's deliberate — __file__

is undefined in interactive environments like Jupyter notebooks or a REPL and raises a NameError

. A relative path works consistently across all contexts.

"1024x1024"

is a common default for image generation models. If you pass a size the model doesn't support, you'll get an error at runtime rather than a silently resized image. Check your provider's documentation for supported dimensions before you assume.

The image from Stage 2 goes into Qwen's vision-language model:

vision_skill = Skill(
    model = Model("qwen-vl-max", api_key=os.environ["DASHSCOPE_API_KEY"]),
    input = {
        "messages": [{
            "role": "user",
            "parts": [
                {"type": "image", "source": {"kind": "base64",
                                              "data": image["base64"],
                                              "mime": image["mime_type"]}},
                {"type": "text",  "text": "What do you see in this image?"},
            ],
        }]
    },
)

analysis = vision_skill.run()
print(f"[image → text · Qwen]\n{analysis}")

The parts

list now contains two items:

An image part — a dictionary with "type": "image"

and a "source"

object. The source specifies "kind": "base64"

, the actual base64 data, and the MIME type — both pulled directly from Stage 2's output dictionary.

A text part — a dictionary with "type": "text"

and the question.

Same parts

structure as Stage 1. The only difference is that instead of bare strings, we use typed dictionaries to describe each piece of content. The vision model receives the image and the question in a single message and Qwen's response comes back as a plain string — something like:

"The image shows a vivid sunset over an ocean. The sky displays gradients of orange, pink, and purple. The sun is partially below the horizon, with its reflection stretching across calm water."

"""
Multimodal pipeline: Text → Image → Text, three different providers.

  1. text  → text   Claude  (claude-3-5-sonnet-20241022)
  2. text  → image           (imagine-image-pro)
  3. image → text   Qwen    (qwen-vl-max)

Required env vars:
    ANTHROPIC_API_KEY
    XAI_API_KEY
    DASHSCOPE_API_KEY
"""

import os, sys, base64, pathlib
from yait_aichain import Model, Skill

text_skill = Skill(
    model = Model("claude-3-5-sonnet-20241022", api_key=os.environ["ANTHROPIC_API_KEY"]),
    input = {"messages": [{"role": "user", "parts": ["Describe a sunset in one sentence."]}]},
)

try:
    description = text_skill.run()
except Exception as e:
    print(f"Stage 1 failed: {e}"); sys.exit(1)
print(f"[text → text · Claude]\n{description}\n")

image_skill = Skill(
    model  = Model("imagine-image-pro", api_key=os.environ["XAI_API_KEY"]),
    input  = {"messages": [{"role": "user", "parts": [description]}]},
    output = {"modalities": ["image"], "format": {"type": "image", "size": "1024x1024"}},
)

try:
    image = image_skill.run()
except Exception as e:
    print(f"Stage 2 failed: {e}"); sys.exit(1)
img_path = pathlib.Path("output_sunset.png")
img_path.write_bytes(base64.b64decode(image["base64"]))
print(f"[text → image]\nsaved → {img_path}\n")

vision_skill = Skill(
    model = Model("qwen-vl-max", api_key=os.environ["DASHSCOPE_API_KEY"]),
    input = {
        "messages": [{
            "role": "user",
            "parts": [
                {"type": "image", "source": {"kind": "base64",
                                              "data": image["base64"],
                                              "mime": image["mime_type"]}},
                {"type": "text",  "text": "What do you see in this image?"},
            ],
        }]
    },
)

try:
    analysis = vision_skill.run()
except Exception as e:
    print(f"Stage 3 failed: {e}"); sys.exit(1)
print(f"[image → text · Qwen]\n{analysis}")

Three providers. Two modality transitions. Each stage wrapped in its own try/except

so a failure at Stage 2 tells you it was Stage 2 — not a cryptic traceback from somewhere inside a provider SDK you didn't even know you were calling.

There's no special "chaining" API. The variable description

(a string) goes directly into image_skill

's input. The variable image

(a dictionary) gets its fields plucked out for vision_skill

's input. Regular Python variables carry data between stages.

When you need to transform data between stages — truncating a description to 200 characters before image generation, for instance — you write normal Python between the calls. No callbacks, no middleware, no pipeline DSL. This is actually one of the things I like about this approach: the "pipeline" is just a script.

The parts

list is what keeps the interface uniform across modalities:

"parts": ["your string here"]

"parts": [{"type": "image", "source": {...}}]

"parts": [image_dict, text_dict]

One structure, every model, every modality.

Notice what's absent from the Skill configurations: no Anthropic client initialization, no provider-specific headers, no DashScope SDK imports. The Model

constructor takes a model name and an API key; provider routing happens internally. Swapping the image generation model means changing one string and one environment variable — nothing else in the script changes.

Once you have this pattern, extensions are straightforward.

Skill

, another Model

, same shape.The models do the hard work. The code connects them — and stays out of the way.

source & further reading

dev.to — original article Poisoning the Well: Defending Agentic Vector Databases from Diagnostic Key Leaks Can an AI agent post Bitcoin as collateral without giving up the keys? You shipped an MCP server. Nobody found it. Here's the fix.

Building a Multimodal AI Pipeline: Text Image Text Across Three Providers

Run your AI side-project on zahid.host