A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

wpnews.pro

cd /news/machine-learning/a-multi-probe-audit-of-clinical-inte… · home › topics › machine-learning › article

[ARTICLE · art-14063] src=arxiv.org ↗ pub=2026-05-26T04:00Z topic=machine-learning verified=true sentiment=· neutral

A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

A multi-probe audit of clinical-interview depression detection benchmarks found that a lightweight hybrid text-plus-LLM-score model achieved a macro-F1 of 0.723 on the E-DAIC dataset under strict subject-disjoint cross-validation, the highest reported under that protocol. The study revealed that leaderboard rankings on E-DAIC's official test split align only moderately with cross-validation results, with the best cross-validation configuration ranking twentieth on the official test and the official-test winner ranking forty-first by cross-validation. External validation showed strong in-domain baselines on CMDC and ANDROIDS but substantially weaker zero-shot transfer to external corpora, while stress tests on E-DAIC found text scores rose sharply on symptom-dense interview slices but audio scores remained flat, indicating modality-specific biases in depression detection.

read1 min views8 publishedMay 26, 2026

arXiv:2605.23977v1 Announce Type: new Abstract: This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across DAIC/E-DAIC, CMDC, ANDROIDS, MODMA, and PDCH. First, we re-evaluate E-DAIC under strict subject-disjoint leave-one-subject-out cross-validation. A lightweight hybrid text-plus-LLM-score model reaches macro-F1 = 0.723 - the highest reported under this protocol, to our knowledge - providing a conservative out-of-fold reference point that does not depend on the privileged official holdout. Second, we test whether the E-DAIC official split supports fine-grained leaderboard rankings by sweeping 96 model configurations across modality bundles, pooling strategies, and learners. Development-side cross-validation and official-test rankings align only moderately: the best cross-validation configuration ranks twentieth on the official test, the official-test winner ranks forty-first by cross-validation, top-3 overlap is zero, and the apparent winner is rank-1 in only 32.3% of subject bootstraps. Third, we externally validate strong public CMDC and ANDROIDS baselines that achieve near-ceiling in-domain performance. Zero-shot transfer to external corpora is substantially weaker. Finally, we stress-test E-DAIC text and audio models using paired symptom-dense versus symptom-light interview slices defined by an SRDS-based annotator. Text scores rise sharply on symptom-dense slices, whereas audio scores remain nearly flat; the text-minus-audio gap is positive across all five seeds.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/a-multi-probe-audit-of-c…

Read original on arxiv.org → arxiv.org/abs/2605.23977

mentioned entities

DAIC

E-DAIC

CMDC

ANDROIDS

MODMA

PDCH

SRDS

metadata

sluga-multi-probe-audit-of-clinical-interview-depression-detection-benchmarks

topic#machine-learning

secondary4 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevShow HN: Self-hosted collaborati…

next →Google Enters The Ecommerce Wars…

── more in #machine-learning 4 stories · sorted by recency

arxiv.org · 16 Jul · #machine-learning

Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers

dev.to · 16 Jul · #machine-learning

The Architecture of Presence: A Manifesto for Embodied AGI

smarterarticles.co.uk · 16 Jul · #machine-learning

The False Friend at Work: How AI Confidants Erode Professional Growth

dev.to · 16 Jul · #machine-learning

I catalogued 32 real AI-agent failures, then marked the ones we cannot stop

── more on @daic 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #ai-chips

D-Matrix launches Corsair AI inference platform, challenging Nvidia’s GPU dominance

wpnews · 8 Jul · #large-language-models

Gemini 3.5 Pro Delayed to July 17: Architectural Rebuild Explained

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required