AI-Orchestrated 3D Asset Pipeline: From JPEG to Game-Ready GLB Without Touching Blender

A developer built an AI-orchestrated 3D asset pipeline that converts JPEG images into game-ready GLB files without manual Blender use. The system uses an AI agent operating Blender through the Model Context Protocol (MCP), with a vision model validating each step by analyzing viewport screenshots. After rigging six animated models for a Godot 4 project, the developer found that the key pattern is teaching the AI agent to handle failures through a vision feedback loop rather than writing perfect scripts.

TL;DR:I built a pipeline where an AI agent operates Blender through MCP Model Context Protocol , while a vision model validates every step by looking at screenshots. I never opened Blender's GUI for modeling. Here's what worked, what broke, and the patterns that emerged after rigging 6+ animated models for a Godot 4 project. I needed animated 3D fish for a virtual aquarium in Godot 4. I don't know Blender. Instead of learning it, I built a pipeline where AI does the work and I supervise. The stack: The architecture: Human instructions → AI Agent generates bpy code → MCP Protocol JSON-RPC over stdio → Blender Addon socket :9876, executes Python → Viewport Screenshot → Vision Model validates result → AI Agent adjusts or proceeds → Export GLB → Godot The human speaks problems. The AI translates them into Blender Python. The vision model confirms whether the result looks correct. Nobody clicks anything in Blender. Traditional 3D pipeline: learn Blender weeks , model manually hours per asset , rig by hand more hours , debug in Godot pain . AI-orchestrated pipeline: describe what you want, AI executes, vision model validates, iterate until correct. First model takes a couple of hours of prompt debugging. By the tenth model, you're done in 10 minutes. The key insight: you don't automate Blender by writing a perfect script once. You automate it by teaching an AI agent to handle failures through a vision feedback loop. This is the most important pattern. Everything else depends on it. 1. AI executes ONE Blender operation 2. Take a screenshot of the viewport 3. Vision model checks the result 4. If OK → next step. If FAIL → undo → try different approach. Why not batch operations? If the AI executes 6 bone extrusions in sequence and something breaks at step 2, neither the AI nor you can tell where it went wrong. One action per cycle means deterministic rollback. Why vision validation? Blender's Python API doesn't always tell you the truth about visual results. A bone might report correct coordinates but visually overlap with another bone. Weights might be "assigned" but produce garbage deformation. The viewport screenshot is ground truth. Anti-stuck rule: if the same approach fails 3 times in a row, the AI must switch strategy. Extrude not working? Try moving the bone directly. Auto-weights failing? Switch to manual Gaussian assignment. A naive prompt to a vision model produces naive answers. "Look at this Blender screenshot" gets you "I see some orange lines." You need structured, domain-specific prompts. Bad: "Check the skeleton" Good: "You are a rigging tech lead. Count the bones in the armature. Check: 1 All bone heads connect to previous bone tails? 2 Last bone reaches the end of the mesh? Answer strictly: bones=N|chain ok=true/false|tail reach=true/false" Three prompt templates that cover 90% of validation: | Mode | Prompt format | When to use | |---|---|---| | Skeleton check | bones=N\ | chain ok=true/false\ | | Rigging check | {% raw %} weights painted=true/false\ | only tip deforms=true/false\ | | State check | {% raw %} mode=EDIT/POSE/OBJECT\ | selected=Bone.006\ | Critical tips: bpy.ops.wm.redraw timer type='DRAW WIN SWAP', iterations=1 . Without this, the screenshot captures a stale frame.Blender retains actions, armature data, and mesh data even after deleting objects from the scene. If you rig Fish A, then import Fish B without cleaning, Fish A's bone animations leak into Fish B's export. Real incident: Koi bone names appeared in Pterophyllum's GLB export, causing "Animation target not found" warnings in Godot. Mandatory cleanup script before each new model: python import bpy Delete all scene objects for obj in list bpy.context.scene.objects : bpy.data.objects.remove obj, do unlink=True Purge all orphan data blocks bpy.ops.outliner.orphans purge do local ids=True, do linked ids=False, do recursive=True Verify: everything should be zero print f"Objects: {len bpy.data.objects }, " f"Actions: {len bpy.data.actions }, " f"Armatures: {len bpy.data.armatures }, " f"Meshes: {len bpy.data.meshes }" Rule: one model at a time. Import → rig → weight → test → export → clean. Only then start the next one. Blender's ARMATURE AUTO weight assignment calculates distance from each bone to each vertex. This works for simple meshes. For thin geometry fins, veils, tails , all bones appear "close" to all vertices, and the algorithm produces garbage. Symptoms: What works instead: manual Gaussian weight assignment. python import math sigma = 0.03 adjust per bone size for v in mesh.data.vertices: v local = arm.matrix world.inverted @ mesh.matrix world @ v.co d = v local - bone head .length if d < sigma 3: w = math.exp -d d / 2 sigma sigma if w 0.05: group.add v.index , w, 'REPLACE' Follow with normalization and smoothing vertex group smooth factor=0.3, repeat=1 . Then validate with the vision model. Another common trap: neutral bone or Root eating all weights. If a bone sits at origin with use deform=True , auto-weights assign it to everything. Fix: bone.use deform = False for utility bones, then re-bind.Many things that work in Blender break silently in Godot. These cost the most debugging time. Blender defaults to Quaternion for armatures after GLB import. If your AI writes bone.rotation euler.x = -0.5 , nothing happens. The bone ignores Euler when in Quaternion mode. Fix: always set bone.rotation mode = 'XYZ' before animating with Euler, or work in Quaternion throughout. If a bone's rest pose isn't aligned to world axes, Godot applies animation offsets relative to a non-identity transform. Result: the jaw nods the entire head instead of opening the mouth. Fix: in Edit Mode, align all bones strictly along X/Y/Z axes. Set roll = 0 for every bone. After posing, clear all transforms — the mesh should not move. If it moves, rest pose is wrong. Godot 4.x sometimes ignores bone scale if rest pose doesn't match skeleton rest. Gill breathing animated via scale.x on a bone worked in Blender but did nothing in Godot. Fix: use Shape Keys blend shapes instead of bone scale for facial/gill animation. Shape Keys work deterministically in both Blender and Godot. Bone animation is only for rotation-based movement swimming, tail wagging . Godot doesn't understand Blender constraints Copy Rotation, etc . They must be baked before export. bpy.ops.nla.bake frame start=1, frame end=60, visual keying=True, bake constraint results clear constraints=True, remove constraints from export bake types={'POSE'} Body axis in Blender is X, in Godot is -Z. All models need a 90° rotation on import. Apply transforms before export: bpy.ops.object.transform apply location=True, rotation=True, scale=True . Blender animation at 30 FPS plays at half speed in Godot's 60 FPS physics. Set AnimationPlayer.speed scale = 2.0 or bake at 60 FPS from the start. The coding AI cannot handle multi-step instructions reliably. "Animate Tail1, Tail2, Tail3 and both pectoral fins" produces bpy.ops.pose.select all and breaks everything. Fix: one bone per call. Animate Tail1 → vision check → animate Tail2 → vision check → ... → bake all together at the end. Blender's API is context-sensitive. Most bpy.ops calls fail with "poll failed, context is incorrect" if you're in the wrong mode. Rules the AI must follow: mode set mode='POSE' → set active = armature mode set mode='WEIGHT PAINT' → set active = mesh mode set mode='EDIT' for armature → first go to OBJECT, then set active, then EDIT select all action='DESELECT' only works in OBJECT modeAfter 3 failed attempts with the same approach, force a strategy change. This must be an explicit rule in the agent's instructions, not a hope. After each model, document what broke and how you fixed it. This creates a growing knowledge base that makes each subsequent model faster. Format: Symptom: what you observed Cause: root cause Fix: code or procedure Applies to: which model types Examples from real production: | | Symptom | Cause | Fix | |---|---|---|---| | 1 | rotation euler has no effect | rotation mode='QUATERNION' | Set rotation mode='XYZ' first | | 2 | Entire body moves when rotating fin | use connect=True on fin bone | Set use connect=False , parent to Spine1 | | 3 | Orphan animations in exported GLB | Previous model's data not purged | Full cleanup script between models | | 4 | Jaw nods the head in Godot | Rest pose not identity | Align bones to world axes, roll=0 | | 5 | Gills don't animate in Godot | Scale on bones ignored by Godot 4 | Use Shape Keys instead of bone scale | | 6 | Vision model says FAIL but code says PASS | Wrong viewport angle | Set camera to RIGHT/FRONT view before screenshot | After ~10 models, PSP becomes your real pipeline. The AI reads it before starting each new model and avoids known pitfalls. First model: 3 hours. Tenth model: 20 minutes. The most powerful pattern that emerged: using the vision model as a test framework. python def assert vision question, expected answer : result = vlm ask screenshot , question if expected answer.lower not in result.lower : raise AssertionError f"Vision assert failed: expected '{expected answer}', got '{result}'" Usage: After rigging assert vision "Tail3 rotated 45°. What bent? A Only tip B Whole tail C Entire body", "A" After weight painting assert vision "Head changed position?", "NO" After animation bake assert vision "Frame 1 and frame 60. Same pose?", "YES" After export and Godot import assert vision "Skeleton visible? Tail bends?", "YES" This is CI/CD for 3D. If you change weights tomorrow, run the assert suite. If anything breaks, you know immediately. 1. Clean Blender scene purge orphans 2. Import GLB from Meshy.ai 3. Orient body along X axis rotate Z -90°, apply transforms 4. Decimate to target polycount ratio 0.15-0.3 5. Create armature: spine chain + fins + jaw 6. Parent mesh to armature with empty vertex groups 7. Assign weights: Gaussian for each bone, normalize, smooth 8. Vision check: rotate each bone → "only target deforms?" 9. Selective zero: remove weight leaks from body to face bones 10. Vision check: jaw/gills move independently? 11. Create swim animation: sin wave on spine chain, 60 frames 12. Vision check: frame 1 = frame 60? Natural motion? 13. Bake action: visual keying=True, clear constraints=True 14. Export GLB with animations and Shape Keys 15. Import in Godot, verify animation plays correctly 16. Clean Blender scene for next model Between steps 7-10, expect 2-5 iterations per bone. This is normal. The feedback loop AI executes → vision validates → AI adjusts converges quickly once PSP covers common failure modes. | Metric | First model | After PSP latest models | |---|---|---| | Time to rigged GLB | ~2 hours | ~10 minutes | | Manual Blender work | Occasional weight painting | Zero | | Vision checks per model | 15-20 | 3-5 | | Export failures | 3-4 attempts | Usually first try | The bottleneck shifted from "learning Blender" to "debugging AI prompts." When the AI makes a mistake, 90% of the time it's because the vision model gave bad feedback. Fix one line in the VLM prompt — the entire system gets smarter. An important optimization emerged during the project. The initial architecture used a small local vision model Qwen3VL-4B purely for validation, while a separate coding AI generated the Blender Python. This meant two models, two contexts, two sets of prompts, and a manual bridge between them. Later, I switched to a larger Qwen model accessed through MCP that could both see the viewport and write code. One model that understands what it's looking at AND knows how to fix it. The feedback loop collapsed from "AI writes code → screenshot → VLM checks → human relays feedback → AI adjusts" to "AI writes code → looks at result → adjusts itself." This cut iteration time significantly. The patterns in this article still apply — one action per check, structured prompts, PSP — but the architecture becomes simpler when vision and coding live in the same model. One action, one check. Never let the AI chain operations blindly. Deterministic rollback requires deterministic steps. Vision validation is non-negotiable. Code can report success while the viewport shows garbage. The screenshot is ground truth. Auto-weights fail on thin geometry. Plan for manual Gaussian assignment on fins, veils, and facial features. Blender and Godot speak different languages. Rest pose identity, quaternion rotation, Shape Keys over bone scale, baked constraints — learn these once, document in PSP, never debug again. PSP is the real product. The pipeline isn't the code. It's the accumulated knowledge of what breaks and how to fix it. Each model teaches the system. The human role is supervisor, not operator. You describe problems in natural language. The AI translates to code. The VLM validates visually. You make decisions when the system gets stuck. The same architecture — AI agent + MCP tool + vision validation — applies beyond Blender. Any GUI-heavy professional tool that exposes an API can be orchestrated this way. The patterns one action/one check, structured VLM prompts, PSP accumulation are universal. The agents aren't replacing 3D artists. They're making 3D accessible to people who have ideas but not the specialized skills to execute them. The quality ceiling is still set by human judgment — but the floor has risen dramatically. Tested on: Linux Mint 22.3, Blender 4.0+, Godot 4.x, NVIDIA RTX 5060 Ti eGPU via Thunderbolt 4 MCP Server: BlenderMCP 1.27.1 Vision Models: Qwen3VL-4B local, llama.cpp → later Qwen larger, unified vision+coding via MCP Author: Aleksandr Kossarev, Jõgeva, Estonia Project: Arche Iscrin https://archiscrin.bandcamp.com This article is based on 2300+ lines of production notes from rigging 6 animated fish models for a Godot virtual aquarium, using an AI-orchestrated pipeline without manual Blender operation.