Sean Du brings a reasoning-model hallucination detector to ICML 2026

wpnews.pro

Sean Xuefeng Du (@xuefeng_du) will present new hallucination-detection work at ICML 2026 in Seoul on July 9, bringing an academic answer to a commercial problem every AI agent company is now trying to price: how to know when a reasoning model is wrong before a user acts on it.

The presentation is for "Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping", a paper by Jianxiong Zhang, Bing Guo, Yuming Jiang, Haobo Wang, Bo An and Xuefeng Du. Du said he will present the work in person on Thursday, July 9, from 2:30 to 4:15 PM KST, in Hall A, Poster #2904 at COEX in Seoul. ICML 2026 runs July 6 to July 11, with the main conference scheduled from July 7 through July 9. Code is available on GitHub.

Sean Du on X This is not a startup launch, a financing, or a commercial product announcement. It is a research presentation from Du, an assistant professor at Nanyang Technological University's College of Computing and Data Science, where he leads RADIO Lab, short for Responsible, Aligned, Deployable Intelligence for human gOod. That distinction matters because the work sits upstream of the product market. The companies building AI search, coding agents, enterprise assistants and research copilots are all leaning harder on models that generate long chains of reasoning. ARS asks whether those chains carry a signal about failure that can be extracted without retraining the model or repeatedly sampling answers at inference time.

Du's own research program has been pointed at that question for years. His homepage says his work focuses on reliable machine learning, foundation model reliability and the blind spots of LLMs and multimodal LLMs, including hallucination detection and mitigation. His CV lists a Ph.D. from the University of Wisconsin-Madison, advised by Sharon Li, and an undergraduate degree in electronic engineering from Xi'an Jiaotong University. It also lists a 2024 Microsoft Research internship on malicious prompt detection for vision-language models, a 2022 Google Research stint on open-vocabulary object detection and a prior line of work under the phrase "Teach AI What It Doesn't Know."

The bet: the reasoning trace is not enough

The ARS paper starts from a problem that has become more visible as reasoning models have moved into production-like workflows: a model can produce a fluent, coherent-looking reasoning trace and still land on a false final answer. For a user, the trace can make the wrong answer feel more trustworthy. For an operator, it creates a detection problem because the text of the chain of thought can vary in style and length, while the thing that matters is whether the final answer is stable and correct.

ARS, short for Answer-agreement Representation Shaping, tries to use the model's own internal state to test that stability. The method perturbs the hidden representation at the boundary between the reasoning trace and the final answer, generates counterfactual answers, checks whether those answers agree with the original answer, and trains a lightweight mapping that pulls agreement states together while pushing disagreement states apart. The authors say the resulting shaped embeddings can be plugged into embedding-based hallucination detectors and do not require human hallucination labels during training.

That is the practical core of the paper. ARS is not asking a second model to grade every final answer at inference time, and it is not sampling many outputs every time a user submits a query. According to the paper, perturbations are used during training to shape the representation; at test time, the shaped embedding can be fed into detectors. If the claim holds up outside benchmark conditions, the method points toward a detector layer that could sit near reasoning agents without adding the full cost of repeated generation.

The numbers are promising, but still benchmark numbers

The headline result in the paper is on TruthfulQA using Qwen3-8B. The authors report that ARS with a CCS detector reached 86.64% AUROC, which they describe as a 19.79 percentage-point improvement over vanilla reasoning-model embeddings. The paper also evaluates on TriviaQA, GSM8K and MATH-500, and includes Qwen3 and DeepSeek-R1-Distill model families.

Those figures should be read as research results, not product guarantees. AUROC is useful for comparing detectors on a benchmark, but it does not answer the deployment questions that matter inside a company: how a threshold behaves on a narrow customer domain, what happens when the model distribution shifts, whether the detector fails silently on adversarial prompts, and how much engineering work is needed to make hidden-state access available in a hosted-model stack.

The paper's comparison table positions ARS against methods including Semantic Entropy, HaloScope, EigenScore, CCS, TSV and G-Detector, as well as supervised probing. The authors' differentiator is where they intervene. Many hallucination-detection approaches operate on outputs, repeated samples, text similarity, verbalized uncertainty, or supervised probes. ARS intervenes at the latent trace boundary and uses answer agreement as the organizing signal.

That boundary is the interesting choice. The paper argues that mid-trace perturbations can scramble the reasoning path before the model has committed to an answer, while perturbing after the answer has started can produce superficial edits. The trace boundary, just before answer generation, is treated as the place where the model has absorbed the reasoning but still has room to change the final answer.

RADIO Lab's first ICML signal

The timing is also worth anchoring. The paper is not brand-new this weekend. arXiv shows the first version was submitted on January 24, 2026, and the current v3 was posted on May 5, 2026. OpenReview lists it as an ICML 2026 regular paper. The current news hook is Du's upcoming in-person poster presentation in Seoul, not the initial release of the research.

RADIO Lab is young and small by public academic-lab standards. Its page lists two Ph.D. students, four master's students, one visiting student, and four alumni. Du's CV lists research funding that is academic rather than venture-backed: a S$150,000 NTU start-up grant for reliable foundation models, a S$100,000 MOE AcRF Tier 1 Seed grant for calibrated confidence estimation, S$6,450 in Google Cloud Research Credits and a S$7,400 NSCC Young Investigator Seed Project.

The author group reaches beyond NTU. Zhang's OpenReview profile identifies him as a Ph.D. student at Sichuan University and an NTU intern. Haobo Wang is an assistant professor and Ph.D. supervisor at Zhejiang University's School of Software Technology, with research interests that include open-world learning, large language models and tabular data analysis. Bo An is a President's Chair Professor at NTU, head of the Division of Artificial Intelligence and director of NTU's Centre of AI-for-X.

Why operators will watch the next step

For AI infrastructure and application founders, the commercial read-through is straightforward. The market has been moving from short-answer chatbots toward agents that plan, call tools, write code, browse, reason and then act. In that setting, a wrong answer is no longer just an embarrassing response; it can become a bad database change, a wrong legal summary, a broken deployment or a faulty research conclusion. ARS is one answer to a narrow but important reliability question: can the system detect answer instability from inside the model rather than relying only on the final text? The work does not prove that reasoning traces are trustworthy. It argues almost the opposite: traces can be persuasive while hiding fragility, and detectors need to focus on whether the answer survives small changes in the model's internal state.

That is why this poster is more than an ICML schedule item. The startup ecosystem is full of products that wrap frontier models in workflow, memory, retrieval and tool use. Reliability claims in that market often arrive as product language before they arrive as measurable methods. Du's ARS work is a reminder that the hard part is not making the model explain itself. It is building systems that can tell when the explanation is carrying a wrong answer.

source & further reading

runtimewire.com — original article Vincenzo's NanoEuler rebuilds a GPT-2-scale training stack in C and CUDA Elon Musk says Grok 4.5 is in private beta at SpaceX and Tesla Head to head: Bagel vs Rundiffusion Photo Flux

Sean Du brings a reasoning-model hallucination detector to ICML 2026

The bet: the reasoning trace is not enough

The numbers are promising, but still benchmark numbers

RADIO Lab's first ICML signal

Why operators will watch the next step

Run your AI side-project on zahid.host