JSON-Schema masks can block needed tool calls

A study reveals that JSON-Schema grammar-based token masks can silently block tool calls in LLM agents, preventing function invocation even when the model intends to use them. Researchers propose a lightweight two-pass inference hack that drops the mask in a second pass, restoring tool invocation rates from 0% to 100% without retraining. The fix adds latency and has not been tested on closed-source models or complex multi-tool workflows.

Grammar‑based token masks can silently block the very function calls an LLM agent must emit. A lightweight two‑pass inference hack sidesteps the problem without retraining the model. Before this work, engineers routinely combined JSON‑Schema output constraints with tool‑calling APIs, assuming the two constraints coexist harmlessly. Existing agents simply turned on the schema validator and let the model decide when to invoke a tool. The suppression stems from the way schemas are enforced: “JSON Schema constraints are compiled into grammar‑based token masks that render tool‑call tokens unreachable during decoding” 1 https://arxiv.org/abs/2606.25605 . The mask prunes every token that would start a function call, so decoding never reaches a valid tool‑call token even though the rest of the response complies with the schema. Running the model in a transparent two‑pass mode eliminates the dead‑end. In the second pass the mask is dropped, allowing the model to emit the missing call, and the paper reports that “Tool Invocation Rate increased from 0% to 100%” 1 https://arxiv.org/abs/2606.25605 . The fix preserves full schema compliance while recovering every required tool activation. The study evaluates open‑weight model families in a production pipeline, but does not assess closed‑source models or more complex multi‑tool workflows, leaving it unclear whether they suffer the same mask‑induced deadlock. Moreover, the extra decoding pass adds additional latency, which may be a concern for real‑time agents. This suggests a need for smarter mask designs that exclude only truly illegal tokens rather than bluntly cutting off all call prefixes. If the two‑pass pattern holds across deployments, any benchmark that measures tool use under schema constraints should be re‑run with the mask disabled in a second pass. More importantly, production agents can adopt the transparent two‑pass strategy as a default safety net, ensuring that tool calls are never silently lost while still guaranteeing structured output.