Anton Yuryev, Biology Director
Ted Slater, Senior Director Product Management
We describe Elsevier Biology knowledge graph – the biggest bioinformatics knowledge graph in the world that is constructed from the data extracted by Elsevier NLP from biomedical literature. The knowledge graph connects more than 1,5 mln biomedical concepts, entities and molecules with more than 14 mln edges supported by more than 60 mln semantic triples, derived from both abstracts and full-text articles. Users can access the Elsevier Biology knowledge graph either via Pathway Studio interface or download any subset of data via API as flat file in variety of formats, or directly into Python network MultiDiGraph representation, or into Neo4j graph database.
We will show how we augment NLP data with data from other publicly available databases and ontologies to develop a collection of manually curated biochemical pathways and models for diseases and biological processes, and for OMICs data analysis.
Next, we will demonstrate how the knowledge graph can be applied for biomedical research using several recent use cases: support for lipidomics research at MD Anderson Cancer Center, finding new treatments for pediatric brain cancer patients from children hospitals at Philadelphia and Zurich, and finding new compounds for skin care cosmetics industry.
We will also describe new work resulting in EpiMap, a knowledge base built upon the Elsevier Biology Knowledge Graph to provide oncology researchers with information about epigenetic modifications in their broader biomedical context as derived from information extracted by NLP from biomedical literature. We will show how we’re applying machine learning to perform link prediction over the graph, and we’ll describe some of our present and future roadmap for FAIR data in biomedical literature and knowledge graphs.
Predictive retrosynthesis is currently an area of intense research to aid organic chemists in discovering new or improved processes to synthesize organic molecules of interest, predict optimal reaction conditions, or predict potential “forward” reactions (and reaction products and conditions) based on a starting organic molecule.
Two main approaches to this problem have emerged, one is a rule-based approach which uses known organic transformations (and their chemical environment) as a rule set to determine viable synthetic pathways. The second approach employs the use of machine learning algorithms that use training sets of diverse chemical reaction types to develop and enhance the algorithm.
For either approach, a limiting factor in the improvement of the predictive methodology is the quality of the available reaction data. Important aspects that render a reaction record as being high quality include reaction atom mapping of starting materials and products, inclusion of reaction yield, time, temperature, pressure, pH, or other vital reaction conditions (ex. use of microwaves, irradiation methods such as UV lamps, mechanical processes, etc), all reagents used, reaction categorization based on either type or name reaction, and a reported experimental procedure.
In this overview we will describe Elsevier’s Reaxys reactions database which contains approx. 57 million reactions spanning over 140 years of reported organic and inorganic reactions from the scientific and patent literature. Examples of data structure and quality will be presented.
This reaction dataset is currently being employed by both academic and corporate research teams to create training sets that are used in predictive retrosynthesis models.