006

The triangular spectrum of research data teams

Data teams in preclinical research operate within a triangular spectrum spanning three core areas: infrastructure, machine learning, and data science. ML often takes most of the glory, but it is only one part of the story.

First, the infrastructure core engineers cloud data store and compute services. In parallel, this core also builds and maintains “glue layers” that connect with lab equipment through LIMS, external CROs, and data labeling services - ideally through some form of API.

This ensures scientists on the team have direct access to data acquired in-house or externally. This core also provides utilities for provisioning compute resources i.e. cloud instances allowing other team members to focus solely on running experiments.

Next, we have the machine learning core, the more fundamental research arm of the team. This arm provides a gateway into current best “state of the art” practices by reviewing the literature, identifying appropriate methods, and implementing them for data scientists to use.

This core also manages and maintains benchmarking datasets to better understand model performance over time, in addition to other ML technicalities including model training bottlenecks and GPU acceleration.

Finally, the data science core works on extracting value and insights from data. This core is deeply integrated within different programs and works very closely with biologists. With new data coming in every week, this core is responsible for analysis and interpretation.

Data science essentially manages the last-mile, that last leg of the data journey. By combining storytelling and visualization, this core communicates findings to the program team based on which a decision is made on how the next iteration of experiments/data should look like.

Members in smaller teams will start by working across these cores, juggling them around as needs evolve. As teams grow, members become more specialized as they are polarized toward one of these cores.