{"slug": "brick-composer-using-mllms-for-assembly-with-diverse-bricks", "title": "Brick-Composer: Using MLLMs for Assembly with Diverse Bricks", "summary": "Researchers have developed Brick-Composer, a learning framework that enables multimodal large language models (MLLMs) to assemble objects from diverse building blocks by improving brick selection accuracy by over three times and raising strict step-level assembly success from less than 1% to around 15%. The framework, introduced alongside the BC-Bench benchmark, trains MLLMs using human demonstrations, physical feedback, and synthetic experience to overcome current models' struggles with fine-grained brick selection and precise pose estimation. After training, a Qwen-3-8B model correctly composed up to 42% of the steps for a complete object, demonstrating that MLLMs can acquire assembly capabilities through targeted, physically grounded learning.", "body_md": "arXiv:2606.05445v1 Announce Type: new\nAbstract: We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable building blocks. As a first step toward this vision, we study whether multimodal large language models (MLLMs) possess the visual grounding and spatial reasoning capabilities required for brick assembly. We formulate brick assembly as a sequential decision-making problem, where each step involves two subtasks: brick selection, identifying the target brick from candidate components, and brick pose estimation, predicting where and how the selected brick should be placed. To support this study, we introduce BC-Bench (Brick Construction Benchmark), the first benchmark for evaluating MLLMs on assembly with diverse bricks. Experiments show that current state-of-the-art MLLMs remain far from reliable builders, struggling with fine-grained brick selection and failing at precise pose estimation. To bridge this gap, we propose Brick-Composer, a learning framework that equips MLLMs with assembly skills through three complementary signals: Human Design Sparks, which provide affordance-rich construction demonstrations; World Feedback, which grounds predicted actions in visual and physical consequences; and Synthetic Experience, which scales learning beyond existing object designs. Brick-Composer improves brick selection accuracy by over three times, substantially reduces pose estimation errors, and raises strict step-level assembly success from less than 1% to around 15%. After training, a Qwen-3-8B can correctly compose up to 42% of the steps for a complete object, suggesting that MLLMs can acquire assembly capabilities through targeted, physically grounded learning.", "url": "https://wpnews.pro/news/brick-composer-using-mllms-for-assembly-with-diverse-bricks", "canonical_source": "https://arxiv.org/abs/2606.05445", "published_at": "2026-06-06 04:00:00+00:00", "updated_at": "2026-06-06 04:18:22.037029+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "computer-vision", "robotics"], "entities": ["Brick-Composer", "BC-Bench", "MLLMs", "Human Design Sparks", "World Feedback", "Synthetic Experience"], "alternates": {"html": "https://wpnews.pro/news/brick-composer-using-mllms-for-assembly-with-diverse-bricks", "markdown": "https://wpnews.pro/news/brick-composer-using-mllms-for-assembly-with-diverse-bricks.md", "text": "https://wpnews.pro/news/brick-composer-using-mllms-for-assembly-with-diverse-bricks.txt", "jsonld": "https://wpnews.pro/news/brick-composer-using-mllms-for-assembly-with-diverse-bricks.jsonld"}}