Built for Track 4: Autopilot Agent — #QwenCloudHackathon
Most AI "agents" are API wrappers with a system prompt. Upload data, call one endpoint, return the result. No verification, no reasoning about what went wrong, no ability to self-correct.
For the Qwen Cloud Hackathon, I built ChemSpectra Agent — an FTIR spectral analysis system where Qwen-3.7-Max autonomously selects tools, cross-validates evidence across multiple results, and triggers self-verification when confidence is low. The key insight: an agent that checks its own work catches errors that single-pass analysis misses.
The agent has access to 5 analysis tools, each hitting a different endpoint of the FTIR.fun spectral library (130,000+ reference spectra):
| Tool | Purpose |
|---|---|
identify_material |
|
| Match spectrum against reference library, return ranked candidates | |
explain_peaks |
|
| Explain what chemical bond vibration each peak represents | |
assign_functional_groups |
|
| Map peaks to functional groups (C=O, O-H, N-H, etc.) | |
match_library_topk |
|
| Rapid top-K screening without deep analysis | |
search_public_results |
|
| Search publicly shared analysis cases (via MCP) |
Instead of hardcoding which tools to call, I define these as Qwen Function Calling schemas and let the model decide:
AGENT_TOOLS = [
{
"type": "function",
"function": {
"name": "identify_material",
"description": "Match spectrum against 130,000+ reference spectra...",
"parameters": {
"type": "object",
"properties": {
"top_k": {"type": "integer", "default": 10},
"sample_type": {"type": "string"},
},
},
},
},
]
response = Generation.call(
api_key=DASHSCOPE_API_KEY,
model="qwen3.7-max",
messages=messages,
tools=AGENT_TOOLS, # Qwen decides which to call
result_format="message",
)
The result: different questions trigger different tool combinations. "What is this material?" → identify_material
explain_peaks
. "Deformulate this sample" → all three analytical tools. "Quick screening" → just match_library_topk
. The LLM decides, not the developer.
The agent runs a Think → Act → Observe loop, up to 6 iterations:
tool_calls
— which tools to invoke and with what parametersIn practice, most analyses complete in 2-3 iterations. Qwen's enable_thinking=True
mode shows the full chain-of-thought reasoning, so you can see why it chose each tool.
After the ReAct loop, the agent doesn't just return results. It runs two automated checks:
Confidence estimation — calculated from match scores, candidate score gaps, and functional group coverage:
def _estimate_confidence(self, session):
scores = []
id_result = session.tool_results.get("identify_material", {})
if id_result.get("matches"):
top_sim = id_result["matches"][0].get("similarity", 0)
scores.append(top_sim)
if len(id_result["matches"]) >= 2:
gap = top_sim - id_result["matches"][1].get("similarity", 0)
scores.append(min(1.0, gap * 5)) # larger gap = more confident
Evidence conflict detection — compares outputs across tools. If identify_material
says "PET" but assign_functional_groups
found no ester groups, that's a contradiction:
expected_groups = {
"pet": ["ester", "c=o", "aromatic"],
"nylon": ["amide", "n-h", "c=o"],
"polyethylene": ["c-h", "ch2", "methylene"],
"silicone": ["si-o", "si-c", "siloxane"],
}
When confidence < 0.75 or conflicts are detected, the agent automatically triggers a verification round. Qwen is told exactly what went wrong:
ISSUES DETECTED:
- functional_group_mismatch: material="pet", missing=["ester", "aromatic"]
- low_confidence: 0.62 (threshold: 0.75)
Qwen then autonomously calls additional tools to investigate. After verification, confidence is recalculated. In testing, I've seen confidence traces like [0.62, 0.84]
— a 35% improvement from one verification round.
When Qwen's structured JSON output fails to parse (it happens — LLMs sometimes wrap JSON in markdown code blocks), the error and original output are sent back to Qwen with context:
repair_messages = messages + [
{"role": "assistant", "content": raw},
{"role": "user", "content": f"Parse error: {raw[:200]!r}\nReturn ONLY valid JSON."},
]
raw_retry = self._call_qwen(repair_messages)
Near-100% recovery rate. No silent failures.
In regulated industries — pharmaceutical QC under FDA 21 CFR Part 11, forensic substance identification, environmental contaminant detection — an AI that returns wrong results without flagging uncertainty is dangerous. ChemSpectra Agent's self-verification turns "AI that gives answers" into "AI that checks its work." The confidence trace provides an audit trail that fits existing compliance frameworks.
All LLM reasoning — tool selection, synthesis, verification, self-repair, follow-up chat, report generation — runs through Alibaba Cloud's dashscope
SDK with qwen3.7-max
. Six distinct call sites, one provider.