Six Bugs Only a Live Model Could Teach Us

AgentOps Debugger, an agentic application for investigating environmental-compliance history in Peru, revealed six critical bugs only exposed by live model output, not offline tests. The system, designed as offline-first with deterministic agents, failed in production when real Qwen agents returned unexpected status values, malformed citations, and incomplete plans. The developer implemented tolerant preprocessors, alias mapping, and plan interpreters to handle these failures, emphasizing that prompt contracts are insufficient without robust code safeguards.

AgentOps Debugger is an agentic application to investigate environmental-compliance history in Peru. The idea is simple: you ask in Spanish or English about companies regulated by OEFA, Peru’s environmental regulator, and the system retrieves public sanction records and regulatory documents, builds cited answers, drafts structured reports behind a human-approval step, and shows the complete trace of how the answer was produced. The stack is: The architecture: a Coordinator plans typed tasks, specialist Qwen agents execute them, and every step is stored in an audit ledger. From the beginning, I designed the project as offline-first . The full system can run without API keys: seed records, lexical BM25 retrieval, deterministic no-LLM agents, and a local demo from docker compose up . That helped a lot because all our 315 tests run without network calls. The app is testable, reproducible, and easy to demo. Then live mode swaps the deterministic agents for real Qwen agents behind the same interfaces. The idea was solid: keep the same typed boundaries, use zod contracts, and validate every structured output. But when we deployed to Alibaba Cloud and started using real qwen-plus , the real lesson appeared: Offline tests are necessary, but they cannot catch the most important failures in an agentic system, because many failures come from the model output distribution, not from your code. We ran the same flows several times against the live model, and six different issues appeared. All tests were green, but the live behavior still broke in ways that only real model output could expose. Our schema expected: status: "completed" | "failed" | "needs user input" But live qwen-plus returned values like: "success" "done" "in progress" Sometimes it also skipped the required summary . The strict parser rejected the whole task, even when the answer itself was useful. Fix: I added tolerant preprocessors. They normalize status synonyms and derive fallback summaries when needed. The lesson here is simple: rejecting a correct answer because of a label mismatch is usually the wrong trade-off. Sometimes the planner returned no tasks and no clarification question. Technically, the output was not useful, but the app still tried to convert it into a normal response. That created a misleading canned answer. Fix: I added a pure plan interpreter that detects degenerate plans, retries once, and then falls back to an honest localized message saying that the system could not derive a plan. Better to be transparent than to pretend the agent understood something it did not. The citation schema expected fields like: documentTitle passage confidence But the model returned variants like: title text high Also, some confidence labels came in English even when the contract expected Spanish-style values. Fix: I added alias mapping and per-item citation salvage. Each citation is validated independently. If one citation is malformed, we drop that one and keep the valid citations. One bad citation should not destroy four good ones. One flow asked the user to approve saving a report, but the planner had not created the report draft first. So the user approved the action, and the system correctly answered: There is no report draft to save. The logic was safe, but the user experience was broken. Fix: The plan interpreter now detects an unpaired save task and inserts the missing draft task before it. The prompt also explains the expected three-task recipe, but the code no longer assumes the model followed it. This is one of the most important lessons: prompt contracts help, but code must still protect the workflow. The app can list sanctioned entities as clickable cards. The user clicks one company, and the run resumes with the selected RUC. In English, the model sometimes received the selected entity and still asked the same clarification question again. Fix: Once the user answers a clarification, the system never asks that same clarification again. From that point, entity resolution is computed from the records. The model can write the narrative, but it does not control whether the entity was resolved. The listing hardened in round 5: entities as clickable candidates. The resolution is deterministic, so it does not depend on the model mood. When asked for a report on: Minera Las Bambas S.A. The model claimed the entity was ambiguous, even though the legal name matched one unique record. Because of that, it produced no data, and the save step failed after approval. Fix: I added full-name resolution in the entity heuristic. Now every model-claimed ambiguity is verified against the data. If the data resolves to one entity, the system answers. If the ambiguity is real, the system builds the candidate list itself. The model can suggest ambiguity, but the data decides whether ambiguity exists. The main principle became clear: Let the LLM narrate, but do not let it own structured outcomes. After these fixes, the important structured parts are deterministic: Qwen is still very useful. It understands the analyst’s intent, plans the work, and writes the legal narrative in Spanish and English using the right Peruvian regulatory terminology. But the system does not ask the model to be the source of truth for things that should come from the records. Even the mandatory disclaimer in the regulatory report is a z.literal . The model cannot rephrase it because the model never owns that part. This is also why the project is called AgentOps Debugger . Every model call and tool call goes into an append-only trace ledger with token counts, latency, and attribution. When the next live issue appears, the trace shows exactly what happened. The Traceability sheet on the live deploy: qwen-plus model calls with tokens and latency, plus tool calls with attribution. A few implementation details that were important: @ai-sdk/openai-compatible , not @ai-sdk/openai For DashScope’s /compatible-mode/v1 , the OpenAI-compatible provider worked better. The regular OpenAI provider can classify non-OpenAI model ids as reasoning models, send a developer role, and target the Responses API. DashScope rejects those assumptions. DashScope requires the literal word json in the messages before it honors: response format: { type: "json object" } One line in the system prompt fixed that. The prompt reduces errors, but it does not eliminate them. The parser must still be defensive: If you only rely on the prompt, the system will break. If you only rely on the parser, the model will drift more often. You need both. For our deployment, the international endpoint was the right choice: dashscope-intl.aliyuncs.com Also, for this type of bulk agent workload, qwen-plus was a better trade-off than qwen-max : capable enough and much cheaper. One non-model bug was very easy to miss. crypto.randomUUID worked locally, but failed on the plain HTTP demo IP because it only exists in secure contexts. So “New investigation” worked on localhost and broke on the deployed URL. Live browser testing matters. The result: a cited answer with charts, evidence chips, and suggested next steps, running live on Alibaba Cloud ECS with real Qwen. An agentic system becomes more trustworthy when every conclusion can be traced back to the records, documents, and decisions that produced it. Qwen Cloud gave us a model strong enough to plan and narrate in two languages over a technical legal domain. But the engineering lesson of this hackathon was not only how to use the model. The real lesson was deciding what the model should not own. For this kind of regulatory workflow, the model can help with intent, planning, and language. But the final structured outcome must be computed, validated, and traceable. Project: AgentOps Debugger — OEFA Environmental Compliance Hackathon: Qwen Cloud Hackathon, Track 3 — Agent Society Code: github.com/GinoLlerena/agentops-debugger-architecture https://github.com/GinoLlerena/agentops-debugger-architecture License: MIT Stack: Qwen on Qwen Cloud, DashScope, Mastra, AI SDK v5, Hono, React, Docker, Alibaba Cloud ECS. This post was written with AI assistance Claude Code — the same assistant we pair-programmed with during the hackathon. The bugs, fixes, and lessons are from our real build log; fittingly, the project it describes is about never trusting unverified AI output.