Three providers, three modalities, under 55 lines of Python β and a PNG file on disk at the end. Claude writes a sunset description, an image generation model paints it, and Qwen Vision analyzes the result. Each model does one thing well; the script wires them together.
This article walks through building exactly that pipeline using yait_aichain's Skill
and Model
primitives. We'll go step by step: generate text with Claude, turn that text into an image, then feed the image to Qwen Vision for analysis.
The pipeline has three stages:
claude-3-5-sonnet-20241022
)imagine-image-pro
)qwen-vl-max
)Each stage uses a different provider β Anthropic, xAI, and DashScope. The output of one stage becomes the input of the next.
You need three API keys, each set as an environment variable:
export ANTHROPIC_API_KEY="your-anthropic-key"
export XAI_API_KEY="your-xai-key"
export DASHSCOPE_API_KEY="your-dashscope-key"
Install the library:
pip install yait_aichain
No extra dependencies for image handling β Python's base64
and pathlib
modules cover the file I/O. yait_aichain handles provider routing internally, so you won't need to install Anthropic, xAI, or DashScope SDKs separately.
** Model** represents a connection to a specific model at a specific provider. You pass the model name and an API key β no provider-specific client classes, no adapter patterns to memorize.
** Skill** is a single unit of work. It takes a
Model
, an input
(structured as messages), and optionally an output
configuration. Call .run()
and it executes. The message format uses a parts
list inside each message, which is how yait_aichain handles multimodal content uniformly β text, images, and mixed content all go through the same structure.
import os, sys, base64, pathlib
from yait_aichain import Model, Skill
text_skill = Skill(
model = Model("claude-3-5-sonnet-20241022", api_key=os.environ["ANTHROPIC_API_KEY"]),
input = {"messages": [{"role": "user", "parts": ["Describe a sunset in one sentence."]}]},
)
description = text_skill.run()
print(f"[text β text Β· Claude]\n{description}\n")
The input
dictionary contains a messages
list β identical in shape to what you'd see in a chat API. Each message has a role
and a parts
list. For plain text, parts
is just a list of strings.
Notice the use of os.environ["KEY"]
rather than os.getenv("KEY")
. This is a deliberate choice I prefer for multi-provider scripts: os.getenv
silently returns None
when a key is missing, which pushes the error down to the provider's API where the message is far less useful. os.environ
raises a KeyError
immediately with the variable name. When you're juggling three different API keys for the first time, you want to know which one is missing.
text_skill.run()
returns the model's response as a string. On a typical call, you'll get something like:
"The sun melted into the horizon, painting the sky in layered bands of amber, rose, and deep violet as the ocean mirrored its fading warmth."
That string becomes the input for Stage 2.
parts
Instead of content
?
The parts
list is the design decision that makes multimodal work without special-casing. A text-only message uses ["some string"]
. A message with an image uses a dictionary inside parts
. A message with both uses both. Same field, same structure, every modality.
We take Claude's text output and pass it to the image generation model as a prompt:
image_skill = Skill(
model = Model("imagine-image-pro", api_key=os.environ["XAI_API_KEY"]),
input = {"messages": [{"role": "user", "parts": [description]}]},
output = {"modalities": ["image"], "format": {"type": "image", "size": "1024x1024"}},
)
image = image_skill.run()
img_path = pathlib.Path("output_sunset.png")
img_path.write_bytes(base64.b64decode(image["base64"]))
print(f"[text β image]\nsaved β {img_path}\n")
Two things to notice here.
The output configuration. This is the first time we specify how the response should come back.
"modalities": ["image"]
tells the Skill we expect an image. The "format"
dictionary specifies the type and dimensions. Without this, the model might return text describing how it The return value. When a Skill produces an image, .run()
returns a dictionary with at least two keys: "base64"
(the image data) and "mime_type"
(e.g., "image/png"
). We decode the base64 data and write it to disk.
pathlib.Path("output_sunset.png")
writes to the current working directory rather than using __file__
. That's deliberate β __file__
is undefined in interactive environments like Jupyter notebooks or a REPL and raises a NameError
. A relative path works consistently across all contexts.
"1024x1024"
is a common default for image generation models. If you pass a size the model doesn't support, you'll get an error at runtime rather than a silently resized image. Check your provider's documentation for supported dimensions before you assume.
The image from Stage 2 goes into Qwen's vision-language model:
vision_skill = Skill(
model = Model("qwen-vl-max", api_key=os.environ["DASHSCOPE_API_KEY"]),
input = {
"messages": [{
"role": "user",
"parts": [
{"type": "image", "source": {"kind": "base64",
"data": image["base64"],
"mime": image["mime_type"]}},
{"type": "text", "text": "What do you see in this image?"},
],
}]
},
)
analysis = vision_skill.run()
print(f"[image β text Β· Qwen]\n{analysis}")
The parts
list now contains two items:
An image part β a dictionary with "type": "image"
and a "source"
object. The source specifies "kind": "base64"
, the actual base64 data, and the MIME type β both pulled directly from Stage 2's output dictionary.
A text part β a dictionary with "type": "text"
and the question.
Same parts
structure as Stage 1. The only difference is that instead of bare strings, we use typed dictionaries to describe each piece of content. The vision model receives the image and the question in a single message and Qwen's response comes back as a plain string β something like:
"The image shows a vivid sunset over an ocean. The sky displays gradients of orange, pink, and purple. The sun is partially below the horizon, with its reflection stretching across calm water."
"""
Multimodal pipeline: Text β Image β Text, three different providers.
1. text β text Claude (claude-3-5-sonnet-20241022)
2. text β image (imagine-image-pro)
3. image β text Qwen (qwen-vl-max)
Required env vars:
ANTHROPIC_API_KEY
XAI_API_KEY
DASHSCOPE_API_KEY
"""
import os, sys, base64, pathlib
from yait_aichain import Model, Skill
text_skill = Skill(
model = Model("claude-3-5-sonnet-20241022", api_key=os.environ["ANTHROPIC_API_KEY"]),
input = {"messages": [{"role": "user", "parts": ["Describe a sunset in one sentence."]}]},
)
try:
description = text_skill.run()
except Exception as e:
print(f"Stage 1 failed: {e}"); sys.exit(1)
print(f"[text β text Β· Claude]\n{description}\n")
image_skill = Skill(
model = Model("imagine-image-pro", api_key=os.environ["XAI_API_KEY"]),
input = {"messages": [{"role": "user", "parts": [description]}]},
output = {"modalities": ["image"], "format": {"type": "image", "size": "1024x1024"}},
)
try:
image = image_skill.run()
except Exception as e:
print(f"Stage 2 failed: {e}"); sys.exit(1)
img_path = pathlib.Path("output_sunset.png")
img_path.write_bytes(base64.b64decode(image["base64"]))
print(f"[text β image]\nsaved β {img_path}\n")
vision_skill = Skill(
model = Model("qwen-vl-max", api_key=os.environ["DASHSCOPE_API_KEY"]),
input = {
"messages": [{
"role": "user",
"parts": [
{"type": "image", "source": {"kind": "base64",
"data": image["base64"],
"mime": image["mime_type"]}},
{"type": "text", "text": "What do you see in this image?"},
],
}]
},
)
try:
analysis = vision_skill.run()
except Exception as e:
print(f"Stage 3 failed: {e}"); sys.exit(1)
print(f"[image β text Β· Qwen]\n{analysis}")
Three providers. Two modality transitions. Each stage wrapped in its own try/except
so a failure at Stage 2 tells you it was Stage 2 β not a cryptic traceback from somewhere inside a provider SDK you didn't even know you were calling.
There's no special "chaining" API. The variable description
(a string) goes directly into image_skill
's input. The variable image
(a dictionary) gets its fields plucked out for vision_skill
's input. Regular Python variables carry data between stages.
When you need to transform data between stages β truncating a description to 200 characters before image generation, for instance β you write normal Python between the calls. No callbacks, no middleware, no pipeline DSL. This is actually one of the things I like about this approach: the "pipeline" is just a script.
The parts
list is what keeps the interface uniform across modalities:
"parts": ["your string here"]
"parts": [{"type": "image", "source": {...}}]
"parts": [image_dict, text_dict]
One structure, every model, every modality.
Notice what's absent from the Skill configurations: no Anthropic client initialization, no provider-specific headers, no DashScope SDK imports. The Model
constructor takes a model name and an API key; provider routing happens internally. Swapping the image generation model means changing one string and one environment variable β nothing else in the script changes.
Once you have this pattern, extensions are straightforward.
Skill
, another Model
, same shape.The models do the hard work. The code connects them β and stays out of the way.