# Study Advances GI Cancer Risk Prediction with ML

> Source: <https://letsdatascience.com/news/study-advances-gi-cancer-risk-prediction-with-ml-2082625b>
> Published: 2026-06-04 17:54:46.807347+00:00

# Study Advances GI Cancer Risk Prediction with ML

A peer-reviewed study published in **JMIR Medical Informatics** evaluates machine learning methods for predicting **gastrointestinal (GI) cancer** risk in a South Korean prospective cohort, where GI cancers are a major health burden. Analyzing 7,652 participants with 156 incident GI cancer cases (about 2%) over 14 years of follow-up, the authors tackle the severe class imbalance that makes rare-disease prediction difficult. They introduce a patient-centered undersampling technique (**PCUSTe**) modeled on frequency-matched case-control design and benchmark it against SMOTE, ADASYN, and hybrid resampling. An incrementally trained stochastic gradient descent model on PCUSTe data reached a sensitivity of 0.77 and an AUC of 0.77, while logistic regression without resampling produced balanced results (sensitivity 0.70, specificity 0.71, AUC 0.75). The authors frame the models as tools for earlier risk stratification and targeted screening.

### What the study did

In a paper published in JMIR Medical Informatics (2026), researchers Daina Baublyte, Jeonghee Lee, Madhawa Gunathilake, and Jeongseon Kim evaluated machine learning approaches for predicting gastrointestinal (GI) cancer risk in a South Korean prospective cohort. GI cancers are a significant health concern in South Korea, and the team focused on noninvasive and minimally invasive predictors tied to modifiable behavioral and metabolic risk factors.

### The data challenge

The cohort included 7,652 individuals, of whom only 156 (about 2%) developed a GI cancer over a 14-year follow-up. According to the study, this rarity creates severe class imbalance that pushes standard models toward the majority 'healthy' class at the expense of clinical sensitivity, the metric that matters most for catching true cases early.

### Method

To address imbalance while preserving population structure, the authors developed a patient-centered undersampling technique (PCUSTe) based on the logic of frequency-matched case-control studies. They compared it against widely used resampling methods, including SMOTE, ADASYN, and SMOTE with edited nearest neighbors, across six classifiers in both batch and incremental forms, and applied probability correction to account for the shift introduced by resampling. Models were evaluated on a held-out test set using thresholds tied to the observed cumulative incidence.

### Results

The study reports that an incrementally trained stochastic gradient descent model on PCUSTe data delivered the strongest overall performance, with a sensitivity of 0.77 (95% CI 0.64-0.89), specificity of 0.65, and AUC of 0.77 (95% CI 0.70-0.84). Logistic regression, by contrast, achieved balanced performance without any resampling (sensitivity 0.70, specificity 0.71, AUC 0.75). The authors note that PCUSTe mainly improved sensitivity in more complex models, and that in some cases adjusting the decision threshold alone matched or beat resampling.

### Why this matters

The authors conclude that combining epidemiological principles, such as covariate frequency matching and incidence-based thresholds, can improve minority-class detection and support personalized risk stratification and targeted screening for rare cancers.

### Editorial analysis

Class imbalance is a recurring obstacle whenever machine learning is applied to rare clinical outcomes, and this work illustrates a broader pattern in which domain knowledge, not only algorithmic complexity, drives gains. As an early-stage modeling study on a single cohort, external validation would be a typical next step before any clinical use.

## Scoring Rationale

Applied ML research addressing a significant regional disease burden; useful for clinicians and researchers but not a foundational model or industry-shaking result.

Practice with real Health & Insurance data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

[See all Health & Insurance problems](/problems/datasets/health)
