Natural Language Processing in Life Sciences and Healthcare: Create more predictive ML models

Leveraging text based data within machine learning models is a major challenge for many biotech and healthcare companies. This text based data can be found in clinical trial notes, medical records, messages between study site coordinators and study participants, and claims documentation. Even supposedly structured data such as lists of co-morbidities, medications or procedures contain such a vast number of choices with few patterns that they more closely resemble unstructured text data. This data typically contains a wealth of information but often goes unused due to the large amount of variability, lack of structure, and the inherent difficulty of translating it into numerical information. Here, we present how natural language processing can unlock the potential of your text based data using techniques available from open source pre-trained large language models (LLMs).

Generally when people think about natural language processing (NLP) they imagine algorithms that interpret and generate human-like language, sentiment analysis on product reviews, or chat bots taking over customer service. To organizations focused on training ML models, NLP can feel esoteric, specialized and not applicable to the primary objective. However, there are straightforward NLP use cases that can be used to create much more useful and predictive ML models. Given the significant value added, we include these NLP techniques in most of our ML projects across domains and modeling approaches.

‍

You’ve collected categorical data but it’s too variable and messy to use

Categorical information, data stored as text based categories, represents a significant opportunity to transform your raw data using NLP and capitalize on unrealized data value. Let’s say you are trying to predict health outcomes for your study participants. You have collected their job titles, what they eat for breakfast every day, their pre-existing medical conditions, and a brief description of how they felt that day. Now, you want to use this data to predict who is at risk for disease or negative health outcomes. Even if you limit these categories to 1-3 words you will end up with 10,000s of discrete categories or unique terms. The data you collected isn’t usable in its raw form because it lacks a level of uniformity needed to identify patterns.

In this scenario you’ll end up with job titles like: “secretary”, “administrative assistant”, “office manager”, and “project coordinator”. Without NLP, each of these becomes its own category and your ML model will struggle to learn across categories. You therefore need an approach to automatically condense similar responses into concise buckets and then go even further and learn the similarity between these buckets.

Using NLP we can condense messy text into concise buckets and then use word vectors to calculate the similarity between discrete words and phrases. Given a single word, its vector provides a deep context for what that term means, how it is used, and other terms to which it is related. Word vectors provide a path to identify synonymous and/or highly related entries thereby allowing the aggregation of items that belong to similar classes and mitigating the challenges presented by high cardinality in categorical labels.

Since these vectors are pre trained as part of open source large language models (LLMs), any organization can quickly augment categorical data and draw value from text fragments, adding contextual depth and richness. For instance, the job titles listed above will be automatically grouped into “Administrative” and kept separate from other professions such as “waitress” or “mechanic”. Breakfast choices of bananas, oatmeal, and pears can be grouped separately from donuts and kolaches. From the chaos of free entry text and diverse categories, NLP can transform the data into actionable groups and patterns thereby supporting much more useful and predictive ML models.

Automatically distilling sentences and paragraphs down to key terms and major ideas

But what if you have longer, more complicated blocks of text? You want to get value from: clinician notes, participant status written in multiple sentences by your study coordinator, short descriptions of an incident or event. You want the ability to reduce this text down to the key points, words, terms, or topics.

If you had a limited number of known terms of interest, you could simply run a search for those terms using something like regular expressions (RegEx). For instance, does a patient file use the word “cancer”. However, NLP allows you to go beyond simple look-ups in several powerful ways.

Create a more complex look-up algorithm by relying on Named Entity Recognition (NER) or vector similarity, searching for ideas and topics rather than precise terms. For example, instead of simply looking for “cancer” you can automatically expand your search for all terms related to cancer like “carcinoma”, “biopsy”, or “chemotherapy”, etc.
Defining a list of concepts or terms that have the most predictive and usable value without prior knowledge. As a real world example, what terms in a surgeon's notes are most predictive of complications during recovery? Alternatively, which terms from a study coordinator’s notes are most predictive of continued participation? Layering in additional ML techniques on top of NLP can make an even more powerful classification approach. For instance, dimensionality reduction of word vectors followed by unsupervised clustering techniques is a way to quickly find actionable patterns and term subgroups in unstructured text.

Bottom Line

Many organizations are sitting on a gold mine of data in unstructured text and categorical data that they fail to utilize effectively, often because they simply don’t realize it’s possible to do so. Using NLP, structure can be imposed on unwieldy blocks of text, and hidden meaning can be extracted from fragments of text to provide order to your data, uncover relationships in the data that were previously hidden, and create more predictive ML models.

Connect with us

OmniScience is a leading AI organization advancing the mission of life science teams using our unparalleled expertise in biology & data science.

We accelerate our customers’ insights and advances in human health, therapeutics, and diagnostics. We are well versed in analytics for clinical trial operations, in developing advanced digital models for biomarkers and in the application of generative AI and machine learning in scientific data sources.

If you have an AI/ML-related question or would like to discuss how data science can help you, reach us at hello@omniscience.bio or on LinkedIn.

Written by:

Jonathan Gallion

VP of AI/ML

Jenna Cicardo

Senior Data Scientist

Published On:

August 1, 2022

Natural Language Processing in Life Sciences and Healthcare: Create more predictive ML models

You’ve collected categorical data but it’s too variable and messy to use

Automatically distilling sentences and paragraphs down to key terms and major ideas

Bottom Line

Connect with us

Our Work in Action ›