Why Computational Biologists Make Great Data Scientists
This article was written in collaboration with fellow computational biologist, Dr. Benu Atri.
Modern computational biologists develop algorithms and models to understand biological systems and relationships. This entails collecting and analyzing extensive sets of data across a wide range of data types. The problems that they work to solve in biology have a lot in common with, and are often harder than, business focused data science problem sets. That experience enables them to quickly become top-performing data scientists across a wide range of business-focused problems.Modern-day biology and business data science have a lot in common.
Big Data is really big in Computational Biology
The advent of the post-genomic era (marked by the completion of the genome sequencing of humans and several model organisms) brought with it massive amounts of genomic sequencing data, proteomic data, and electronic medical records data.
Computational biologists are trained to work efficiently with terabytes of biological information, so big data is not a novel challenge. For example, a single patient’s genome contains 3 billion base pairs and requires handling ~200 gigabytes worth of DNA sequence fragments or reads, performing assembly, mapping, and quantification. The 1000 Genomes Project, which is often used by many biologists, contains 1700 participants’ genomes to make up 200 terabytes of data. Few non-bioinformatics data scientists have had to deal with datasets this large on a regular basis.
A trained computational biologist is already comfortable with the storage, accession, and analyses of huge volumes of diverse data and understands how to utilize computation time and resources efficiently.
Data is incredibly noisy in every Computational Biology problem
Biology is inherently messy and full of noisy and confounding variables. Variations across experimental measurements resulting from technical limitations and biological variability further confound data analysis and modeling.Computational biologists wade through this chaos by utilizing the scientific method to generate well-reasoned hypotheses that can reign in some of the noise by accounting for the limitations and assumptions around the data and variables. Analyzing noisy data requires high levels of rigor and reproducibility and computational biologists are trained to achieve this by enforcing statistical constraints, measures of confidence, and appropriate controls. These skills are directly translatable and can be applied to parse noisy business data for actionable insights.
Businesses come across ‘dirty’ data regularly that are rife with errors such as spelling and punctuation mistakes, inaccuracies and inconsistencies, and outdated entries and duplications. These sources of noise negatively affect overall business efficiency and misguide key decisions in tricky ways. A computational biologist immediately understands the nuances of these problems and how they affect interpretations downstream.
Biologists always have to work with Incomplete data
Even though advancements in biology push the limits of our knowledge every day, current biological information is far from complete. It follows that we may never fully understand everything, and, pardon the philosophy, we may not even know that which we do not know.
In biology, knowledge about a given organism or pathway does not always generalize and translate to other systems. Biologists understand that this incompleteness is a part of the package.
When scientists publish results, they often omit data for “failed” experiments. This paints an incomplete picture. Unfortunately, this can lead to conclusions that apply to a particular niche and are not generalizable. Biologists are largely unable to learn from others’ mistakes, failed experiments, or poor research design. They learn to make educated guesses but the output of their models must still make accurate and reliable predictions when applied to real data.
Even if data from every past experiment were to be available, the amount of incomplete knowledge around cause and effect in biological systems is enormous. Computational biologists work well around missing data and incomplete knowledge.
Biological data is complex and unstructured data
Computational scientists build complex representations to capture information, look for patterns, and make predictive models.
Extracting meaningful phrases from unstructured text to simplify complex statements such as those found in a medical record (Credit: Benu Atri).
In ecology, evolution, genetics, genomics, or microbiology, unorganized data is relatively standard. Unstructured text can be anything from the recent important research to a healthcare provider’s patient notes. Developing data structures to representing this data is critical to make inferences and new discoveries.
Biologists love ontologies and hierarchies where individual entities (ideas or concepts) are related, usually in a non-linear fashion, and ranking is essential. One example is evaluating evolutionary relationships among biological entities — often species, individuals or genes (Fig. 2). Representing these related entities requires building hierarchies. Confidently clustering connected and non-linear data while accounting for confounders is a standard challenge for a computational biologist.
A standard tree-like representation of the hierarchical relationships between different clades of organisms. Credit: Benu Atri
Business data is often unstructured and finding creative ways to extract, model, and then represent this data is key to leveraging it toward new insights.
Biologists are also used to working with small data
Despite moving towards more massive datasets in biology, small sample sizes are also regularly encountered. Since collecting patient data requires volunteers to follow through the entire duration of the study, clinical studies often end up with a sample size of 20. Regardless of the size of the data set, real patient data is invaluable.
Computational biologists are comfortable with small sample sizes, and regularly build robust statistical pipelines around those data to extract meaning and extrapolate to the “big picture” by connecting the dots and making educated predictions. They have been trained to make the most of very little.
The amount of this noisy, unstructured data increases exponentially
With new instruments and discoveries, biological data is continually expanding and updating. This ever-changing data requires continually improving databases and methods to keep models recent and relevant. With enough “patterns,” machines can be trained to consider many unexpected outcomes to create a more reliable pipeline.
The ever-changing nature of data is a challenge faced by both business and computational biology alike. At a company that monitors trends, if consumer preferences change, so will the data, making it hard to establish absolute rules.
At MDS, we love turning computational biologists into elite business data scientists. Transitioning towards a business-oriented data science company is the perfect segue for a computational biologist because, like in biology, noisy, incomplete, big, complex, and ever-changing data is widespread in business.
Benu Atri is a Data Scientist with Mercury Data Science, with more than 10 years of bioinformatics research experience. She holds a Ph.D. in Quantitative and Computational Biosciences from Baylor College of Medicine. When she is not at work, she enjoys tutoring middle school kids and avidly follows all space missions.
Angela Wilkins is the co-founder and managing director at Mercury Data Science and works with companies to identify machine learning solutions for complex data problems. Prior to MDS, Angela was a member of faculty research at Baylor College of Medicine and led projects at the policy think tank, Center of Science and Law. She developed her machine learning knowledge in the biomedical field as part of IBM’s Watson AI and DARPA Simplex Project. Angela received her M.S and Ph.D. from Lehigh University, all in Theoretical Physics.