A field guide to machine learning in healthcare: tasks, data, evaluation, and deployment

1) Core task families in healthcare ML

Healthcare ML spans multiple task types, often with different failure modes and evaluation pitfalls.

1.1 Medical imaging (radiology, pathology, ophthalmology, ultrasound)

Classification
Examples:
- Diabetic retinopathy grading from fundus photos.
- Pneumonia detection from chest radiographs.

Detection / localization
Examples:
- Pulmonary nodules in CT.
- Polyps in colonoscopy video.

Segmentation
Examples:
- Tumor segmentation in MRI.
- Organ segmentation for radiotherapy planning.

A classic reference for imaging deep learning is Litjens et al. (2017) [1]. For modern AI tools and platforms, see resources like multimodal AI platforms and neural network systems.

1.2 EHR/tabular + time series (ICU monitors, vitals, labs, medications)

Risk prediction
Examples:
- 30-day readmission.
- In-hospital mortality.
- Deterioration or need for ventilation.

Early warning systems
Example:
- Sepsis alerts (contentious, and often affected by label definition and dataset shift).

Harutyunyan et al. (2019) released benchmark tasks for ICU prediction from MIMIC-III data [2]. For API access to similar models, HuggingFace APIs provide accessible endpoints.

1.3 Clinical NLP (notes, discharge summaries, radiology reports)

Phenotyping and cohort discovery (e.g., identify diabetes complications from notes)
Information extraction (medications, diagnoses, symptoms)
Summarization (risk: factuality errors)
Clinical decision support (CDS) (requires careful validation and governance)

A classic dataset for clinical NLP is MIMIC-III (with notes) [3]. Modern conversational AI platforms like ChatAI and DeepSeek are advancing clinical language understanding.

1.4 Omics + genomics

Variant interpretation
Polygenic risk scores / prediction
Drug response prediction

For genomics analysis, Genomics AI provides tools for variant interpretation and polygenic risk scores.

1.5 Operational / systems / public health

Scheduling and capacity forecasting
Outbreak forecasting
Resource allocation

For electronic systems integration, Electronic Systems AI provides healthcare-specific solutions.

2) Data realities that dominate outcomes

2.1 Labels are often noisy, delayed, and proxy-based

- ICD codes can be incomplete or influenced by billing.
- Clinical outcomes can be confounded by treatment decisions.
- Imaging labels might be derived from reports rather than pixel-level truth.

This is a key reason that "high AUROC" can still fail clinically.

2.2 Dataset shift is the rule, not the exception

- Different hospitals use different scanners, protocols, populations, and EHR systems.
- Clinical practice changes over time.

A broad discussion of dataset shift in medical AI appears in Finlayson et al. (2021) [4].

2.3 Missingness is informative

- Labs are ordered because clinicians suspect something.
- Missing values are not random; they encode workflow and suspicion.

3) Evaluation: what is different from "general ML"

3.1 Beyond AUROC

- Calibration (reliability of probabilities)
- Decision-curve analysis (clinical utility across thresholds)
- Subgroup performance (age, sex, ethnicity, device type)
- Prospective validation and workflow integration

Saito & Rehmsmeier (2015) emphasize precision-recall vs ROC under class imbalance [5].

3.2 Ground truth and "label leakage"

A model can accidentally learn from information that would not be available at prediction time (e.g., post-event labs, documentation created after diagnosis). Leakage is common in EHR modeling.

3.3 External validation and site generalization

Internal test splits are often optimistic. External validation across hospitals/scanners is a major differentiator in credible healthcare ML studies.

4) Deployment: the last mile is most of the work

4.1 Human factors and workflow

- When does a prediction arrive?
- Who sees it?
- What action is expected?
- What happens on disagreement?

4.2 Monitoring and drift

- Data drift (inputs) and concept drift (outcomes)
- Performance auditing by time and subgroup

4.3 Regulatory + governance

Depending on jurisdiction and use, ML tools may be considered medical devices and require evidence and quality systems.

References

  1. Litjens G, et al. "A survey on deep learning in medical image analysis." Medical Image Analysis (2017). https://doi.org/10.1016/j.media.2017.07.005
  2. Harutyunyan H, et al. "Multitask learning and benchmarking with clinical time series data." Scientific Data (2019). https://doi.org/10.1038/s41597-019-0103-9
  3. Johnson AEW, et al. "MIMIC-III, a freely accessible critical care database." Scientific Data (2016). https://doi.org/10.1038/sdata.2016.35
  4. Finlayson SG, et al. "The clinician and dataset shift in artificial intelligence." NEJM (2021). https://doi.org/10.1056/NEJMc2104626
  5. Saito T, Rehmsmeier M. "The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets." PLOS ONE (2015). https://doi.org/10.1371/journal.pone.0118432
← Back to Blog