{"slug": "building-a-multimodal-ai-pipeline-text-image-text-across-three-providers", "title": "Building a Multimodal AI Pipeline: Text Image Text Across Three Providers", "summary": "A developer built a multimodal AI pipeline in under 55 lines of Python using yait_aichain's Skill and Model primitives, chaining Claude, an image generation model, and Qwen Vision across three providers (Anthropic, xAI, DashScope). The pipeline generates a sunset description with Claude, creates an image from that text, and then analyzes the result with Qwen Vision, demonstrating unified multimodal handling through a parts list structure.", "body_md": "Three providers, three modalities, under 55 lines of Python — and a PNG file on disk at the end. Claude writes a sunset description, an image generation model paints it, and Qwen Vision analyzes the result. Each model does one thing well; the script wires them together.\n\nThis article walks through building exactly that pipeline using yait_aichain's `Skill`\n\nand `Model`\n\nprimitives. We'll go step by step: generate text with Claude, turn that text into an image, then feed the image to Qwen Vision for analysis.\n\nThe pipeline has three stages:\n\n`claude-3-5-sonnet-20241022`\n\n)`imagine-image-pro`\n\n)`qwen-vl-max`\n\n)Each stage uses a different provider — Anthropic, xAI, and DashScope. The output of one stage becomes the input of the next.\n\nYou need three API keys, each set as an environment variable:\n\n```\nexport ANTHROPIC_API_KEY=\"your-anthropic-key\"\nexport XAI_API_KEY=\"your-xai-key\"\nexport DASHSCOPE_API_KEY=\"your-dashscope-key\"\n```\n\nInstall the library:\n\n```\npip install yait_aichain\n```\n\nNo extra dependencies for image handling — Python's `base64`\n\nand `pathlib`\n\nmodules cover the file I/O. yait_aichain handles provider routing internally, so you won't need to install Anthropic, xAI, or DashScope SDKs separately.\n\n** Model** represents a connection to a specific model at a specific provider. You pass the model name and an API key — no provider-specific client classes, no adapter patterns to memorize.\n\n** Skill** is a single unit of work. It takes a\n\n`Model`\n\n, an `input`\n\n(structured as messages), and optionally an `output`\n\nconfiguration. Call `.run()`\n\nand it executes. The message format uses a `parts`\n\nlist inside each message, which is how yait_aichain handles multimodal content uniformly — text, images, and mixed content all go through the same structure.\n\n``` python\nimport os, sys, base64, pathlib\nfrom yait_aichain import Model, Skill\n\ntext_skill = Skill(\n    model = Model(\"claude-3-5-sonnet-20241022\", api_key=os.environ[\"ANTHROPIC_API_KEY\"]),\n    input = {\"messages\": [{\"role\": \"user\", \"parts\": [\"Describe a sunset in one sentence.\"]}]},\n)\n\ndescription = text_skill.run()\nprint(f\"[text → text · Claude]\\n{description}\\n\")\n```\n\nThe `input`\n\ndictionary contains a `messages`\n\nlist — identical in shape to what you'd see in a chat API. Each message has a `role`\n\nand a `parts`\n\nlist. For plain text, `parts`\n\nis just a list of strings.\n\nNotice the use of `os.environ[\"KEY\"]`\n\nrather than `os.getenv(\"KEY\")`\n\n. This is a deliberate choice I prefer for multi-provider scripts: `os.getenv`\n\nsilently returns `None`\n\nwhen a key is missing, which pushes the error down to the provider's API where the message is far less useful. `os.environ`\n\nraises a `KeyError`\n\nimmediately with the variable name. When you're juggling three different API keys for the first time, you want to know *which* one is missing.\n\n`text_skill.run()`\n\nreturns the model's response as a string. On a typical call, you'll get something like:\n\n\"The sun melted into the horizon, painting the sky in layered bands of amber, rose, and deep violet as the ocean mirrored its fading warmth.\"\n\nThat string becomes the input for Stage 2.\n\n`parts`\n\nInstead of `content`\n\n?\nThe `parts`\n\nlist is the design decision that makes multimodal work without special-casing. A text-only message uses `[\"some string\"]`\n\n. A message with an image uses a dictionary inside `parts`\n\n. A message with both uses both. Same field, same structure, every modality.\n\nWe take Claude's text output and pass it to the image generation model as a prompt:\n\n```\nimage_skill = Skill(\n    model  = Model(\"imagine-image-pro\", api_key=os.environ[\"XAI_API_KEY\"]),\n    input  = {\"messages\": [{\"role\": \"user\", \"parts\": [description]}]},\n    output = {\"modalities\": [\"image\"], \"format\": {\"type\": \"image\", \"size\": \"1024x1024\"}},\n)\n\nimage    = image_skill.run()\nimg_path = pathlib.Path(\"output_sunset.png\")\nimg_path.write_bytes(base64.b64decode(image[\"base64\"]))\nprint(f\"[text → image]\\nsaved → {img_path}\\n\")\n```\n\nTwo things to notice here.\n\n**The output configuration.** This is the first time we specify how the response should come back.\n\n`\"modalities\": [\"image\"]`\n\ntells the Skill we expect an image. The `\"format\"`\n\ndictionary specifies the type and dimensions. Without this, the model might return text describing how it **The return value.** When a Skill produces an image, `.run()`\n\nreturns a dictionary with at least two keys: `\"base64\"`\n\n(the image data) and `\"mime_type\"`\n\n(e.g., `\"image/png\"`\n\n). We decode the base64 data and write it to disk.\n\n`pathlib.Path(\"output_sunset.png\")`\n\nwrites to the current working directory rather than using `__file__`\n\n. That's deliberate — `__file__`\n\nis undefined in interactive environments like Jupyter notebooks or a REPL and raises a `NameError`\n\n. A relative path works consistently across all contexts.\n\n`\"1024x1024\"`\n\nis a common default for image generation models. If you pass a size the model doesn't support, you'll get an error at runtime rather than a silently resized image. Check your provider's documentation for supported dimensions before you assume.\n\nThe image from Stage 2 goes into Qwen's vision-language model:\n\n```\nvision_skill = Skill(\n    model = Model(\"qwen-vl-max\", api_key=os.environ[\"DASHSCOPE_API_KEY\"]),\n    input = {\n        \"messages\": [{\n            \"role\": \"user\",\n            \"parts\": [\n                {\"type\": \"image\", \"source\": {\"kind\": \"base64\",\n                                              \"data\": image[\"base64\"],\n                                              \"mime\": image[\"mime_type\"]}},\n                {\"type\": \"text\",  \"text\": \"What do you see in this image?\"},\n            ],\n        }]\n    },\n)\n\nanalysis = vision_skill.run()\nprint(f\"[image → text · Qwen]\\n{analysis}\")\n```\n\nThe `parts`\n\nlist now contains two items:\n\n**An image part** — a dictionary with `\"type\": \"image\"`\n\nand a `\"source\"`\n\nobject. The source specifies `\"kind\": \"base64\"`\n\n, the actual base64 data, and the MIME type — both pulled directly from Stage 2's output dictionary.\n\n**A text part** — a dictionary with `\"type\": \"text\"`\n\nand the question.\n\nSame `parts`\n\nstructure as Stage 1. The only difference is that instead of bare strings, we use typed dictionaries to describe each piece of content. The vision model receives the image and the question in a single message and Qwen's response comes back as a plain string — something like:\n\n\"The image shows a vivid sunset over an ocean. The sky displays gradients of orange, pink, and purple. The sun is partially below the horizon, with its reflection stretching across calm water.\"\n\n```\n\"\"\"\nMultimodal pipeline: Text → Image → Text, three different providers.\n\n  1. text  → text   Claude  (claude-3-5-sonnet-20241022)\n  2. text  → image           (imagine-image-pro)\n  3. image → text   Qwen    (qwen-vl-max)\n\nRequired env vars:\n    ANTHROPIC_API_KEY\n    XAI_API_KEY\n    DASHSCOPE_API_KEY\n\"\"\"\n\nimport os, sys, base64, pathlib\nfrom yait_aichain import Model, Skill\n\n# ── 1. Text → Text (Claude) ──────────────────────────────────────────────────\ntext_skill = Skill(\n    model = Model(\"claude-3-5-sonnet-20241022\", api_key=os.environ[\"ANTHROPIC_API_KEY\"]),\n    input = {\"messages\": [{\"role\": \"user\", \"parts\": [\"Describe a sunset in one sentence.\"]}]},\n)\n\ntry:\n    description = text_skill.run()\nexcept Exception as e:\n    print(f\"Stage 1 failed: {e}\"); sys.exit(1)\nprint(f\"[text → text · Claude]\\n{description}\\n\")\n\n# ── 2. Text → Image ──────────────────────────────────────────────────────────\nimage_skill = Skill(\n    model  = Model(\"imagine-image-pro\", api_key=os.environ[\"XAI_API_KEY\"]),\n    input  = {\"messages\": [{\"role\": \"user\", \"parts\": [description]}]},\n    output = {\"modalities\": [\"image\"], \"format\": {\"type\": \"image\", \"size\": \"1024x1024\"}},\n)\n\ntry:\n    image = image_skill.run()\nexcept Exception as e:\n    print(f\"Stage 2 failed: {e}\"); sys.exit(1)\nimg_path = pathlib.Path(\"output_sunset.png\")\nimg_path.write_bytes(base64.b64decode(image[\"base64\"]))\nprint(f\"[text → image]\\nsaved → {img_path}\\n\")\n\n# ── 3. Image → Text (Qwen Vision) ────────────────────────────────────────────\nvision_skill = Skill(\n    model = Model(\"qwen-vl-max\", api_key=os.environ[\"DASHSCOPE_API_KEY\"]),\n    input = {\n        \"messages\": [{\n            \"role\": \"user\",\n            \"parts\": [\n                {\"type\": \"image\", \"source\": {\"kind\": \"base64\",\n                                              \"data\": image[\"base64\"],\n                                              \"mime\": image[\"mime_type\"]}},\n                {\"type\": \"text\",  \"text\": \"What do you see in this image?\"},\n            ],\n        }]\n    },\n)\n\ntry:\n    analysis = vision_skill.run()\nexcept Exception as e:\n    print(f\"Stage 3 failed: {e}\"); sys.exit(1)\nprint(f\"[image → text · Qwen]\\n{analysis}\")\n```\n\nThree providers. Two modality transitions. Each stage wrapped in its own `try/except`\n\nso a failure at Stage 2 tells you it was Stage 2 — not a cryptic traceback from somewhere inside a provider SDK you didn't even know you were calling.\n\nThere's no special \"chaining\" API. The variable `description`\n\n(a string) goes directly into `image_skill`\n\n's input. The variable `image`\n\n(a dictionary) gets its fields plucked out for `vision_skill`\n\n's input. Regular Python variables carry data between stages.\n\nWhen you need to transform data between stages — truncating a description to 200 characters before image generation, for instance — you write normal Python between the calls. No callbacks, no middleware, no pipeline DSL. This is actually one of the things I like about this approach: the \"pipeline\" is just a script.\n\nThe `parts`\n\nlist is what keeps the interface uniform across modalities:\n\n`\"parts\": [\"your string here\"]`\n\n`\"parts\": [{\"type\": \"image\", \"source\": {...}}]`\n\n`\"parts\": [image_dict, text_dict]`\n\nOne structure, every model, every modality.\n\nNotice what's absent from the Skill configurations: no Anthropic client initialization, no provider-specific headers, no DashScope SDK imports. The `Model`\n\nconstructor takes a model name and an API key; provider routing happens internally. Swapping the image generation model means changing one string and one environment variable — nothing else in the script changes.\n\nOnce you have this pattern, extensions are straightforward.\n\n`Skill`\n\n, another `Model`\n\n, same shape.The models do the hard work. The code connects them — and stays out of the way.", "url": "https://wpnews.pro/news/building-a-multimodal-ai-pipeline-text-image-text-across-three-providers", "canonical_source": "https://dev.to/yait/building-a-multimodal-ai-pipeline-text-image-text-across-three-providers-22oo", "published_at": "2026-06-26 11:02:00+00:00", "updated_at": "2026-06-26 11:33:43.187580+00:00", "lang": "en", "topics": ["artificial-intelligence", "generative-ai", "large-language-models", "developer-tools", "computer-vision"], "entities": ["Claude", "Qwen Vision", "Anthropic", "xAI", "DashScope", "yait_aichain", "imagine-image-pro", "claude-3-5-sonnet-20241022"], "alternates": {"html": "https://wpnews.pro/news/building-a-multimodal-ai-pipeline-text-image-text-across-three-providers", "markdown": "https://wpnews.pro/news/building-a-multimodal-ai-pipeline-text-image-text-across-three-providers.md", "text": "https://wpnews.pro/news/building-a-multimodal-ai-pipeline-text-image-text-across-three-providers.txt", "jsonld": "https://wpnews.pro/news/building-a-multimodal-ai-pipeline-text-image-text-across-three-providers.jsonld"}}