Improved uses of AI in drug development will come from a more integrated understanding of the complex links between genes, proteins, and disease. A substantial amount of crucial biological information is currently locked within over 34 million scientific publications.
Large language models (LLMs) are a new way to better understand complex scientific relationships that are important in modeling disease biology. Although the most powerful LLMs are not specifically designed for scientific language, they are trained on a vast dataset that includes all publicly available research publications. Our objective was to explore whether we could effectively tap into the untapped knowledge hidden within an LLM and generate outputs that are both useful and factually accurate.
We first tested that an out of the box GPT model can correctly interpret ambiguous terms in biomedical abstracts, understanding which entities (targets, diseases, drugs, etc.) are being referred to, even when not explicitly defined.
We then fine-tuned GPTNeoX, an open source GPT model, on the large set of biomedical entities and known synonyms that we have curated for our ERGO platform.
This process also taught the model to provide outputs in a structured, consistent format.
We are now using this newly trained model to improve the quality and accuracy of the data in our ERGO platform, creating a more complete and accurate graph representation of biology to improve our understanding of disease.