Building a Self-Verifying FTIR Agent with Qwen Function Calling

A developer built ChemSpectra Agent, an FTIR spectral analysis system using Qwen-3.7-Max function calling, for the Qwen Cloud Hackathon. The agent autonomously selects from five analysis tools, cross-validates evidence, and triggers self-verification when confidence is low or conflicts are detected. The system demonstrates an agent that checks its own work, catching errors that single-pass analysis misses.

Built for Track 4: Autopilot Agent — QwenCloudHackathon Most AI "agents" are API wrappers with a system prompt. Upload data, call one endpoint, return the result. No verification, no reasoning about what went wrong, no ability to self-correct. For the Qwen Cloud Hackathon, I built ChemSpectra Agent — an FTIR spectral analysis system where Qwen-3.7-Max autonomously selects tools, cross-validates evidence across multiple results, and triggers self-verification when confidence is low. The key insight: an agent that checks its own work catches errors that single-pass analysis misses. The agent has access to 5 analysis tools, each hitting a different endpoint of the FTIR.fun spectral library 130,000+ reference spectra : | Tool | Purpose | |---|---| identify material | Match spectrum against reference library, return ranked candidates | explain peaks | Explain what chemical bond vibration each peak represents | assign functional groups | Map peaks to functional groups C=O, O-H, N-H, etc. | match library topk | Rapid top-K screening without deep analysis | search public results | Search publicly shared analysis cases via MCP | Instead of hardcoding which tools to call, I define these as Qwen Function Calling schemas and let the model decide: AGENT TOOLS = { "type": "function", "function": { "name": "identify material", "description": "Match spectrum against 130,000+ reference spectra...", "parameters": { "type": "object", "properties": { "top k": {"type": "integer", "default": 10}, "sample type": {"type": "string"}, }, }, }, }, ... 4 more tools response = Generation.call api key=DASHSCOPE API KEY, model="qwen3.7-max", messages=messages, tools=AGENT TOOLS, Qwen decides which to call result format="message", The result: different questions trigger different tool combinations. "What is this material?" → identify material + explain peaks . "Deformulate this sample" → all three analytical tools. "Quick screening" → just match library topk . The LLM decides, not the developer. The agent runs a Think → Act → Observe loop, up to 6 iterations: tool calls — which tools to invoke and with what parametersIn practice, most analyses complete in 2-3 iterations. Qwen's enable thinking=True mode shows the full chain-of-thought reasoning, so you can see why it chose each tool. After the ReAct loop, the agent doesn't just return results. It runs two automated checks: Confidence estimation — calculated from match scores, candidate score gaps, and functional group coverage: python def estimate confidence self, session : scores = id result = session.tool results.get "identify material", {} if id result.get "matches" : top sim = id result "matches" 0 .get "similarity", 0 scores.append top sim if len id result "matches" = 2: gap = top sim - id result "matches" 1 .get "similarity", 0 scores.append min 1.0, gap 5 larger gap = more confident ... more signals from other tools Evidence conflict detection — compares outputs across tools. If identify material says "PET" but assign functional groups found no ester groups, that's a contradiction: expected groups = { "pet": "ester", "c=o", "aromatic" , "nylon": "amide", "n-h", "c=o" , "polyethylene": "c-h", "ch2", "methylene" , "silicone": "si-o", "si-c", "siloxane" , } If 2+ expected groups are missing → conflict When confidence < 0.75 or conflicts are detected, the agent automatically triggers a verification round. Qwen is told exactly what went wrong: ISSUES DETECTED: - functional group mismatch: material="pet", missing= "ester", "aromatic" - low confidence: 0.62 threshold: 0.75 Qwen then autonomously calls additional tools to investigate. After verification, confidence is recalculated. In testing, I've seen confidence traces like 0.62, 0.84 — a 35% improvement from one verification round. When Qwen's structured JSON output fails to parse it happens — LLMs sometimes wrap JSON in markdown code blocks , the error and original output are sent back to Qwen with context: repair messages = messages + {"role": "assistant", "content": raw}, {"role": "user", "content": f"Parse error: {raw :200 r}\nReturn ONLY valid JSON."}, raw retry = self. call qwen repair messages Near-100% recovery rate. No silent failures. In regulated industries — pharmaceutical QC under FDA 21 CFR Part 11, forensic substance identification, environmental contaminant detection — an AI that returns wrong results without flagging uncertainty is dangerous. ChemSpectra Agent's self-verification turns "AI that gives answers" into "AI that checks its work." The confidence trace provides an audit trail that fits existing compliance frameworks. All LLM reasoning — tool selection, synthesis, verification, self-repair, follow-up chat, report generation — runs through Alibaba Cloud's dashscope SDK with qwen3.7-max . Six distinct call sites, one provider.