AI 3D tools need product evals, not benchmark faith

wpnews.pro

If you are building AI-generated 3D tooling, treat public benchmarks as lead signals, not product truth. A model can score well on an OpenSCAD-style benchmark and still be dangerous inside your app, because your product is not grading text against a reference file. It is asking users to trust generated geometry, measurements, layout intent, and downstream editability.

That changes the bar completely. The real question is not "which model topped the benchmark?" It is "what errors can this model make inside my workflow, and how cheaply can I catch them before the user pays for them?"

For CAD-like tools, room planners, parametric builders, scene generators, and layout systems, that question matters more than leaderboard position. Benchmarks are still useful. They help you narrow candidates and avoid obvious dead ends. But if you ship based on benchmark scores alone, you are outsourcing product judgment to someone else’s task design.

A benchmark usually tells you something real. It can reveal whether a model follows structured prompts, emits syntactically valid code, and handles a certain family of geometry tasks better than its peers. That is valuable.

What it does not tell you is whether the model is good at your failure boundary.

A benchmark can reward the wrong thing for a production tool:

That last point matters most. In 3D tooling, the average result is often less important than the ugly 5 percent. If the model occasionally creates self-intersecting meshes, non-manifold solids, overlapping walls, impossible clearances, or silently wrong measurements, the benchmark score stops being comforting.

A practical rule: use public benchmarks to choose what to test, not what to trust.

If a model performs well on an OpenSCAD benchmark, that is a reason to include it in your eval set. It is not a reason to expose generated geometry directly to paying users.

Most teams make the same mistake here. They evaluate the model at the prompt layer, but their product risk lives at the artifact layer.

If your product accepts a natural-language request like "make a 4x6 meter room with a centered 900mm door and a 1.2 meter window on the east wall," your eval should not stop at "did the model produce plausible code?" It should verify whether the generated result satisfies the actual contract:

That means your eval dataset needs to be product-specific.

A good internal eval set usually includes 30 to 100 tasks before you scale further. The point is not dataset size. The point is coverage of the decisions your product actually makes.

For a room-layout tool, that might include:

For a parametric CAD assistant, include:

The key is that each case should have a machine-checkable success condition where possible.

{
  "id": "room-door-window-01",
  "prompt": "Create a 4m x 6m room with a 900mm centered door on the south wall and a 1200mm window on the east wall, 1m from the northeast corner.",
  "checks": {
    "room_width_mm": 4000,
    "room_length_mm": 6000,
    "door_width_mm": 900,
    "door_centered_on_wall": true,
    "window_width_mm": 1200,
    "window_offset_from_ne_corner_mm": 1000,
    "no_opening_overlap": true,
    "manifold_geometry": true
  },
  "severity_if_wrong": "high"
}

That structure is already more useful than a generic prompt-response benchmark, because it tells you what failure means in your product.

Not all errors are equal. A mislabeled material is annoying. A wrong cutout dimension can ruin fabrication. A sofa overlapping a wall is ugly. A staircase with impossible rise/run values is unsafe.

So weight your evals accordingly.

A good scoring model usually separates:

That last category is where benchmark worship really breaks down. A fluent-looking result that is dimensionally wrong is much worse than an obvious failure, because users trust it longer.

In 3D generation, pretty demos hide the expensive bugs. You should assume the model can produce syntactically valid output that is still operationally broken.

That is why your evals need geometry-aware checks, not just text-level scoring.

For CAD-like and layout tools, these are usually the ones that matter:

You do not need a perfect automated judge for all of these on day one. But you do need to stop pretending that valid text output is a sufficient proxy.

The most practical architecture is usually LLM plus verifier, not LLM alone.

If the model emits OpenSCAD, CAD parameters, or scene JSON, run deterministic checks after generation and before surfacing the result. Use the model for synthesis; use code for trust.

from dataclasses import dataclass

@dataclass
class EvalResult:
    passed: bool
    errors: list[str]
    score: float

def validate_room(spec, artifact) -> EvalResult:
    errors = []

    if artifact.room.width_mm != spec["room_width_mm"]:
        errors.append("room width mismatch")

    if artifact.room.length_mm != spec["room_length_mm"]:
        errors.append("room length mismatch")

    if not artifact.geometry.is_manifold():
        errors.append("non-manifold geometry")

    if artifact.openings.overlap():
        errors.append("opening overlap")

    if artifact.units != "mm":
        errors.append("unexpected units")

    hard_fail = any(msg in errors for msg in [
        "room width mismatch",
        "room length mismatch",
        "non-manifold geometry",
    ])

    return EvalResult(
        passed=not hard_fail,
        errors=errors,
        score=max(0, 1 - 0.25 * len(errors)),
    )

This is unglamorous, and that is exactly the point. If your product depends on geometry being right, you need boring validators in front of user trust.

Official references like OpenSCAD help when your generation target is code-based, because you can often parse, render, and inspect outputs deterministically. That is much safer than evaluating only by screenshot quality.

The fastest way to hurt trust is to present generated geometry as if it were authoritative.

The safer rollout path is staged.

In the first version, the model should propose, not decide.

Good early-product patterns:

That product framing matters. Users are much more forgiving of a "generated draft" than a "done model" that later proves wrong.

This is especially important for iterative editing workflows. If a user asks, "make the countertop 300mm deeper but keep the sink centered," they are not asking for a fresh hallucination. They are asking for constraint-preserving transformation. Those are different jobs, and they should have different guardrails.

A strong 3D tool does not only ask, "can the model generate this?" It asks, "when the model is wrong, can the system recover cheaply?"

That means storing enough structure to support repairs:

If you reduce everything to one final text blob, every correction becomes a full regeneration. That is fragile.

A better pattern is intermediate representation first, generated artifact second. Let the model fill a schema, validate the schema, then compile to the final representation.

type LayoutIntent = {
  room: { widthMm: number; lengthMm: number };
  openings: Array<{
    kind: "door" | "window";
    wall: "north" | "south" | "east" | "west";
    widthMm: number;
    offsetMm: number;
  }>;
  furniture: Array<{
    kind: string;
    xMm: number;
    yMm: number;
    rotationDeg: number;
  }>;
};

That schema gives you something you can validate, diff, repair, and version. The generated scene or CAD code becomes a compilation target, not the only source of truth.

Offline evals are necessary, but they are not enough. Once real users start pushing the tool, they will discover edge cases your synthetic set missed.

The correct move is to build a feedback loop that turns production failures back into eval cases.

When a generation fails, capture more than the prompt:

That gives you a real source of truth for future evals. Otherwise you end up debugging vibes instead of failures.

A useful internal taxonomy is simple:

gen_valid_user_accepted

gen_valid_user_repaired

gen_invalid_blocked_by_validator

gen_invalid_escaped_to_user

gen_refused_correctly

Now you can measure whether the system is improving in ways that matter.

The metric I would care about most is not public benchmark position. It is failure escape rate: how often a materially wrong artifact reaches the user as if it were usable.

That metric aligns with product trust.

If benchmark score improves by 8 percent but escape rate barely moves, you probably improved syntax, not safety. If benchmark score stays flat but invalid geometry reaching users drops sharply, that is real progress.

This is the contrarian part builders need to accept: the best model for your product may not be the benchmark winner. It may be the one that works best with your validators, preserves constraints more reliably, degrades more honestly, or produces artifacts your pipeline can safely repair.

If I were building an AI-powered 3D or CAD-adjacent tool today, I would use public benchmarks only to shortlist candidate models. Then I would build a product eval set with strict constraint checks, geometry validation, and severity-weighted scoring. I would ship proposal mode first, keep structured intermediate representations, and block any artifact that fails deterministic validation.

I would also assume that some failures will still escape, so I would log enough evidence to turn production mistakes into new eval cases every week.

That is slower than posting a benchmark chart and declaring victory. It is also how you avoid shipping a tool that looks intelligent in demos and becomes expensive in real use.

The practical decision rule is simple: never trust a 3D generation model more than your validators trust the artifact it produced. In this category, benchmarks help you start. They should not decide when you are safe to ship.

Read the full post on QCode: https://qcode.in/how-to-build-ai-generated-3d-tools-without-trusting-benchmarks/

source & further reading

dev.to — original article Distributed AI in Action: Deploying and Optimizing Mesh LLM on iroh for Scalable Machine Learning I Told My AI "You're Safe to Say I Don't Know." Then I Measured What Changed — With Logprobs. The 3-step smoke test I use for any OpenAI-compatible API

AI 3D tools need product evals, not benchmark faith

Run your AI side-project on zahid.host