Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

wpnews.pro

cd /news/large-language-models/can-post-training-turn-llms-into-goo… · home › topics › large-language-models › article

[ARTICLE · art-27537] src=arxiv.org ↗ pub=2026-06-15T04:00Z topic=large-language-models verified=true sentiment=↑ positive

Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

A new study finds that post-training techniques, including supervised fine-tuning and reinforcement learning, can transform large language models into effective medical coders for ICD coding, challenging prior assumptions that LLMs are weak at this task. The researchers introduce PHI, a diagnostic curriculum that improves recall of missed codes, and release their code and data.

read1 min publishedJun 15, 2026

arXiv:2606.13940v1 Announce Type: new Abstract: Automated International Classification of Diseases (ICD) coding is a core medical-coding task for billing, epidemiology, and clinical decision support. Generative large language models (LLMs) are often reported as weak medical coders, but this finding mainly comes from inference-time settings such as prompting, retrieval, reranking, or tool use, leaving the role of task-specific post-training underexplored. We present a controlled empirical study of post-training for generative ICD coding, comparing discriminative baselines with LLM coders across prompting, supervised fine-tuning, and reinforcement learning under a common protocol and metric set. To our knowledge, this is the first study to evaluate RL-based post-training for generative LLM coders in ICD coding. We further introduce PHI, a diagnostic curriculum that extends GRPO to refine missed-code cases. Our results show that prompting-only evaluation substantially underestimates the potential of LLMs for ICD coding. SFT provides the main capability jump, GRPO further improves code-set prediction beyond SFT, and PHI provides targeted gains on macro-level performance. These findings suggest that the main bottleneck is not the generative formulation alone, but how the model is adapted and optimized for full-taxonomy recall. We release our code, data splits, and checkpoints at https://github.com/AlexandreWANG915/LLM4ICD.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/can-post-training-turn-l…

Read original on arxiv.org → arxiv.org/abs/2606.13940

mentioned entities

LLM4ICD

GRPO

PHI

International Classification of Diseases

arXiv

metadata

slugcan-post-training-turn-llms-into-good-medical-coders-an-empirical-study-of-icd

topic#large-language-models

secondary3 topics

sentimentpositive

langen

canonicalarxiv.org

navigation

← prevDomain-Specific AI for Pharma, B…

next →5 Claude Automation Tricks That …

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 15 Jun · #large-language-models

YeasierAgent: Agentic Social Sandbox as a Canvas for Intent-Driven Creation of Platform-Agnostic Symbiotic Agent-Native Applications

arxiv.org · 15 Jun · #large-language-models

TwinBI: An Agentic Digital Twin for Efficient Augmented Interactions with Business Intelligence Dashboards

arxiv.org · 15 Jun · #large-language-models

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

arxiv.org · 15 Jun · #large-language-models

RT-VLA: Real-Time Vision-Language-Action Models via Knowledge Distillation

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required