ML-readiness across 3 data sources: Preclinical research, clinical research, and clinical practice

The state of being ML-ready can vary widely across different data types, primarily influenced by their source. The distinction here is between experimental data collected within preclinical or clinical research, and real world clinical data.

Any data collected within an experimental context - a drug discovery assay or a clinical trial - often come with high levels of ML-readiness. This is a function of inherent control over the experimental design, characteristics of subjects/samples, and potential confounders.

Real world clinical data, on the other hand, is provided as is. Good luck convincing a physician to write a clinical note in a more standard and structured manner, or getting a technician to change the way a routine CT image is acquired. This translates to low levels of control.

Readily annotated data enables supervised learning. In experiments, labels are identified a priori and are therefore available from day one. In a real world setting, labels are created by different stakeholders in different locations at different times.

This requires multiple glue layers to construct a single annotated training sample. It may start with a patient’s MRI from the radiology PACS, linking it to an oncologist’s note in the EHR, and coupling it with a pathologist’s cancer grade from LIMS.

Working with ML-ready data enables data professionals to spend more time on the actual modeling, storytelling, and visualization. The other end of the spectrum requires considerable data engineering effort before diving into any ML work.