Automatic Information Extraction from Materials Scholarly Literature

Project Description

Materials science researchers report on material structures, synthesis methods, and other experimental data in scholarly literature. This key knowledge can play a critical role in data-driven materials discovery. Unfortunately, this valuable knowledge is significantly underutilized as it remains buried in text, which is unstructured and not machine understandable. This challenge is exasperated because it is simply not feasible for human researchers to read every single article in their fields, given there are millions of publications, and the number is still growing exponentially. In this project, students will work with researchers in Drexel University’s Metadata Research Center, connected with the NSF/ID4 (Institute for Data Driven Dynamical Design) project. The focus will be on investigating the use of natural language processing techniques to extract key knowledge entities and their relationships from unstructured text. We seek to develop robust deep learning models which enable automatic knowledge extraction and ultimately construct knowledge graphs from scholarly corpus.

Research Goals

  • Pre-train language models for downstream NLP tasks in materials science
  • Develop different deep learning models to improve extraction performance
  • Construct solid external knowledge sources (e.g., taxonomy, ontology) for future research

Learning Goals

  • Gain knowledge of deep learning frameworks such as Pytorch
  • How to generate language representations as features for deep learning models
  • Obtain better understanding of the complete workflow of information extraction (named entity recognition/relation extraction)

Groups Conducting Research