Success in pursuit of discovery and development begins with a thorough understanding of what has been discovered, what is actively being explored, what has worked, what has failed, and what questions remain unanswered.
The solution: Literature review.
The problem: An unwieldy and ever-growing mountain of literature rife with unreplicable results and dead ends.In an era of extraordinarily rapid scientific innovation, exacerbated by the pressures of a world-wide “publish or perish” academic culture, the rate of publication is skyrocketing. Trying to merely keep up in one’s own area of expertise is becoming a Sisyphean task. Scouring the literature in related fields for novel connections and insights that could lead to a Eureka finding is even harder.
The traditional solution is to hire an army of research assistants to plumb the depths of PubMed and Google Scholar in search of informational gold. Or just ignore prior work until it is time to write a paper (but then it is too late to learn from past research). A faster, cheaper, more accurate, and more efficient solution would be an AI research assistant capable of navigating the technical language inherent in your field and (almost) instantly reviewing the entire corpus of up-to-date literature, revealing connections related to your terms of interest, and summarizing major themes and take-home messages. Such an AI assistant could even assist your human assistants and make them more efficient!
There are three features in particular that upgrade our NLP (natural language processing) platform from just another data mining resource to research assistant AI (RAAI):
Most “out of the box” NLP platforms and models are trained on an enormous corpus of written language, typically from the web or from news articles. These make for excellent tools when it comes to common language applications, but fall far short when it comes to highly technical documents. While good scientific language models exist (and we make use of them), this still leaves something to be desired when it comes to having a comprehensive technical vocabulary for “named entity recognition” (NER), or the ability of the application to annotate terms of interest and “understand” what they refer to.
NLP language models encode information about the “meanings” of words in vectors, and to some extent, these vectors contain some information about relationships between words. In the most famous example, king - man + woman = queen, it is clear that the underlying word vectors of king and queen describe a gender relationship. With more technical terminology, however, the relationships in the vectors are less clear, and it is difficult to extract these relationships.
To overcome these hurdles, we leverage the curated and highly navigable knowledge found in ontologies. Ontologies significantly expand the scope of vocabulary that RAAI can search, and preserve the hierarchical relationships between terms. The benefits of these two features in a NLP application are numerous. For one, it allows the connection of a pool of synonymous terms to be associated with one another, and to tie back to a single “canonical” form of the term, rather than each being treated as different conceptual entities. For two, it enriches the connections found in the literature with relational information found in the ontologies.
To use a biological example, let’s consider the gene angiotensin I converting enzyme 2 (ACE2), which is known to play a role in the regulation of cardiovascular and renal function, and is a functional receptor for the spike glycoprotein of the human coronavirus SARS-CoV-2. Incorporating the HUGO Gene Nomenclature Committee database, RAAI knows that “ACE2”, “angiotensin I converting enzyme 2”, and “peptidyl-dipeptidase A” are synonyms, and mentions in the literature of any of these synonyms would be marked as a mention of ACE2.
Additional ontologies provide information that can aid discovery. For instance, adding the Gene Ontology (GO) to RAAI associates ACE2 with a number of known biological processes, including virus receptor activity, regulation of cytokine production, angiotensin maturation, regulation of systemic arterial blood pressure by renin-angiotensin, and endopeptidase activity. Using this database, we also know that ACE2 is part of the angiotensin-converting enzyme PANTHER protein family (PTHR10514), which is a subclass of metalloproteases. Each of these biological processes and PANTHER protein families, in turn, is associated with a number of other biological processes, genes and gene homologs, and higher-level protein families. This hierarchical and lateral information enriches the connections we find in the literature.
RAAI finds relationships between terms and returns a measure of confidence that these terms are connected. At the most basic level, a relationship between terms of interest can be established by looking at how often they co-occur in the same sentences in the literature. While not particularly sophisticated, this method performs surprisingly well given a large enough corpus of text. A higher level of sophistication comes with examining subject-verb-object triplets within sentences, wherein the subject and object are “named entities” (recognized words from our vocabulary). The verbs that describe the relationships between terms can be categorized and relationships extracted: “stimulates”, “inhibits”, “treats”, “prevents”, “is a”, “augments”...etc. This upgrades the co-occurrence relationships of “is somehow related to” to more specific, directional, and meaningful connections.
We are currently in the development of more cutting-edge features for RAAI relationship extraction, including inter-sentence relationship extraction, expanded relationships based on ontological connections between terms, and second- and third-order connections from the literature. These features will change the nature of relationship extraction from summarization of findings from the literature to hypothesis generation of putative relationships not yet in print.
Finally, it is not enough for an NLP platform to simply return literature related to a search, along with a list of extracted relationships. This would require the user to still read through a heap of text. RAAI uses topic modeling of literature abstracts to deduce major shared themes, and returns the top 5 representative sentences directly from the literature for each theme, along with the references. This summarization gives the reader an “at a glance” summary of the abstracts, and subsets of references that can be further pursued based on interest in the theme. It’s a kind of “too long; didn’t read” summary of a list of literature that is likely too long for any one person to efficiently read and really synthesize. Future developments in our “literature gestalt extractor” include the ability of our RAAI to de novo compile a summary paragraph for each theme; in essence, writing the beginnings of a review article based on a user query.
Given how important and laborious a literature review can be, this pain point of the research industries is ripe for innovation. Existing NLP tools and platforms somewhat reduce the friction, but what is lacking is a customized AI research assistant. Our platform seeks to close this performance gap with an enriched ontologically based technical vocabulary, the extraction of directional and higher-order relationships, and a summary of themes found in the literature.
We invite you to test and view the public beta release at https://www.covid19research.ai/.
Connect with us.
With expertise in NLP, Machine Learning, Bioinformatics, and Video/Voice Analytics and a passion for cutting edge data science, our team is always looking for ways to enhance discoveries and accelerate your potential. If you have questions about our app, have a data science related question, or would like to discuss your AI strategy, we’d love to hear from you! Reach out today at firstname.lastname@example.org, on Twitter at @mercurydatasci, or on LinkedIn.