{"slug": "six-bugs-only-a-live-model-could-teach-us", "title": "Six Bugs Only a Live Model Could Teach Us", "summary": "AgentOps Debugger, an agentic application for investigating environmental-compliance history in Peru, revealed six critical bugs only exposed by live model output, not offline tests. The system, designed as offline-first with deterministic agents, failed in production when real Qwen agents returned unexpected status values, malformed citations, and incomplete plans. The developer implemented tolerant preprocessors, alias mapping, and plan interpreters to handle these failures, emphasizing that prompt contracts are insufficient without robust code safeguards.", "body_md": "**AgentOps Debugger** is an agentic application to investigate environmental-compliance history in Peru.\n\nThe idea is simple: you ask in Spanish or English about companies regulated by OEFA, Peru’s environmental regulator, and the system retrieves public sanction records and regulatory documents, builds cited answers, drafts structured reports behind a human-approval step, and shows the complete trace of how the answer was produced.\n\nThe stack is:\n\n*The architecture: a Coordinator plans typed tasks, specialist Qwen agents execute them, and every step is stored in an audit ledger.*\n\nFrom the beginning, I designed the project as **offline-first**.\n\nThe full system can run without API keys: seed records, lexical BM25 retrieval, deterministic no-LLM agents, and a local demo from `docker compose up`\n\n.\n\nThat helped a lot because all our 315 tests run without network calls. The app is testable, reproducible, and easy to demo. Then live mode swaps the deterministic agents for real Qwen agents behind the same interfaces.\n\nThe idea was solid: keep the same typed boundaries, use `zod`\n\ncontracts, and validate every structured output.\n\nBut when we deployed to Alibaba Cloud and started using real `qwen-plus`\n\n, the real lesson appeared:\n\nOffline tests are necessary, but they cannot catch the most important failures in an agentic system, because many failures come from the model output distribution, not from your code.\n\nWe ran the same flows several times against the live model, and six different issues appeared. All tests were green, but the live behavior still broke in ways that only real model output could expose.\n\nOur schema expected:\n\n```\nstatus: \"completed\" | \"failed\" | \"needs_user_input\"\n```\n\nBut live `qwen-plus`\n\nreturned values like:\n\n```\n\"success\"\n\"done\"\n\"in_progress\"\n```\n\nSometimes it also skipped the required `summary`\n\n.\n\nThe strict parser rejected the whole task, even when the answer itself was useful.\n\n**Fix:** I added tolerant preprocessors. They normalize status synonyms and derive fallback summaries when needed.\n\nThe lesson here is simple: rejecting a correct answer because of a label mismatch is usually the wrong trade-off.\n\nSometimes the planner returned no tasks and no clarification question.\n\nTechnically, the output was not useful, but the app still tried to convert it into a normal response. That created a misleading canned answer.\n\n**Fix:** I added a pure plan interpreter that detects degenerate plans, retries once, and then falls back to an honest localized message saying that the system could not derive a plan.\n\nBetter to be transparent than to pretend the agent understood something it did not.\n\nThe citation schema expected fields like:\n\n```\ndocumentTitle\npassage\nconfidence\n```\n\nBut the model returned variants like:\n\n```\ntitle\ntext\nhigh\n```\n\nAlso, some confidence labels came in English even when the contract expected Spanish-style values.\n\n**Fix:** I added alias mapping and per-item citation salvage.\n\nEach citation is validated independently. If one citation is malformed, we drop that one and keep the valid citations.\n\nOne bad citation should not destroy four good ones.\n\nOne flow asked the user to approve saving a report, but the planner had not created the report draft first.\n\nSo the user approved the action, and the system correctly answered:\n\nThere is no report draft to save.\n\nThe logic was safe, but the user experience was broken.\n\n**Fix:** The plan interpreter now detects an unpaired save task and inserts the missing draft task before it.\n\nThe prompt also explains the expected three-task recipe, but the code no longer assumes the model followed it.\n\nThis is one of the most important lessons: prompt contracts help, but code must still protect the workflow.\n\nThe app can list sanctioned entities as clickable cards. The user clicks one company, and the run resumes with the selected RUC.\n\nIn English, the model sometimes received the selected entity and still asked the same clarification question again.\n\n**Fix:** Once the user answers a clarification, the system never asks that same clarification again.\n\nFrom that point, entity resolution is computed from the records. The model can write the narrative, but it does not control whether the entity was resolved.\n\n*The listing hardened in round 5: entities as clickable candidates. The resolution is deterministic, so it does not depend on the model mood.*\n\nWhen asked for a report on:\n\nMinera Las Bambas S.A.\n\nThe model claimed the entity was ambiguous, even though the legal name matched one unique record.\n\nBecause of that, it produced no data, and the save step failed after approval.\n\n**Fix:** I added full-name resolution in the entity heuristic.\n\nNow every model-claimed ambiguity is verified against the data. If the data resolves to one entity, the system answers. If the ambiguity is real, the system builds the candidate list itself.\n\nThe model can suggest ambiguity, but the data decides whether ambiguity exists.\n\nThe main principle became clear:\n\nLet the LLM narrate, but do not let it own structured outcomes.\n\nAfter these fixes, the important structured parts are deterministic:\n\nQwen is still very useful. It understands the analyst’s intent, plans the work, and writes the legal narrative in Spanish and English using the right Peruvian regulatory terminology.\n\nBut the system does not ask the model to be the source of truth for things that should come from the records.\n\nEven the mandatory disclaimer in the regulatory report is a `z.literal`\n\n. The model cannot rephrase it because the model never owns that part.\n\nThis is also why the project is called **AgentOps Debugger**. Every model call and tool call goes into an append-only trace ledger with token counts, latency, and attribution. When the next live issue appears, the trace shows exactly what happened.\n\n*The Traceability sheet on the live deploy: qwen-plus model calls with tokens and latency, plus tool calls with attribution.*\n\nA few implementation details that were important:\n\n`@ai-sdk/openai-compatible`\n\n, not `@ai-sdk/openai`\n\nFor DashScope’s `/compatible-mode/v1`\n\n, the OpenAI-compatible provider worked better.\n\nThe regular OpenAI provider can classify non-OpenAI model ids as reasoning models, send a `developer`\n\nrole, and target the Responses API. DashScope rejects those assumptions.\n\nDashScope requires the literal word `json`\n\nin the messages before it honors:\n\n```\nresponse_format: { type: \"json_object\" }\n```\n\nOne line in the system prompt fixed that.\n\nThe prompt reduces errors, but it does not eliminate them.\n\nThe parser must still be defensive:\n\nIf you only rely on the prompt, the system will break. If you only rely on the parser, the model will drift more often. You need both.\n\nFor our deployment, the international endpoint was the right choice:\n\n```\ndashscope-intl.aliyuncs.com\n```\n\nAlso, for this type of bulk agent workload, `qwen-plus`\n\nwas a better trade-off than `qwen-max`\n\n: capable enough and much cheaper.\n\nOne non-model bug was very easy to miss.\n\n`crypto.randomUUID`\n\nworked locally, but failed on the plain HTTP demo IP because it only exists in secure contexts.\n\nSo “New investigation” worked on localhost and broke on the deployed URL.\n\nLive browser testing matters.\n\n*The result: a cited answer with charts, evidence chips, and suggested next steps, running live on Alibaba Cloud ECS with real Qwen.*\n\nAn agentic system becomes more trustworthy when every conclusion can be traced back to the records, documents, and decisions that produced it.\n\nQwen Cloud gave us a model strong enough to plan and narrate in two languages over a technical legal domain.\n\nBut the engineering lesson of this hackathon was not only how to use the model. The real lesson was deciding what the model should not own.\n\nFor this kind of regulatory workflow, the model can help with intent, planning, and language.\n\nBut the final structured outcome must be computed, validated, and traceable.\n\nProject: **AgentOps Debugger — OEFA Environmental Compliance**\n\nHackathon: **Qwen Cloud Hackathon, Track 3 — Agent Society**\n\nCode: [github.com/GinoLlerena/agentops-debugger-architecture](https://github.com/GinoLlerena/agentops-debugger-architecture)\n\nLicense: MIT\n\nStack: Qwen on Qwen Cloud, DashScope, Mastra, AI SDK v5, Hono, React, Docker, Alibaba Cloud ECS.\n\n*This post was written with AI assistance (Claude Code) — the same assistant we pair-programmed with during the hackathon. The bugs, fixes, and lessons are from our real build log; fittingly, the project it describes is about never trusting unverified AI output.*", "url": "https://wpnews.pro/news/six-bugs-only-a-live-model-could-teach-us", "canonical_source": "https://dev.to/ginollerena/six-bugs-only-a-live-model-could-teach-us-57k5", "published_at": "2026-07-04 05:00:01+00:00", "updated_at": "2026-07-04 05:18:49.814262+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "developer-tools", "ai-products", "natural-language-processing"], "entities": ["AgentOps Debugger", "OEFA", "Qwen", "Alibaba Cloud", "Peru"], "alternates": {"html": "https://wpnews.pro/news/six-bugs-only-a-live-model-could-teach-us", "markdown": "https://wpnews.pro/news/six-bugs-only-a-live-model-could-teach-us.md", "text": "https://wpnews.pro/news/six-bugs-only-a-live-model-could-teach-us.txt", "jsonld": "https://wpnews.pro/news/six-bugs-only-a-live-model-could-teach-us.jsonld"}}