Using natural language processing techniques to extract information on the properties and functionalities of energetic materials from large text corpora

Daniel C. Elton 1 , Dhruv Turakhia 1 , Nischal Reddy 1 , Zois Boukouvalas 1 , Ruth M. Doherty 2 , Mark D. Fuge 1 , Peter W. Chung 1

1 University of Maryland, College Park, College Park, USA
2 Energetics Technology Center, Indian Head, USA

Abstract. The number of scientific journal articles and reports being published about energetic materials every year is growing exponentially, and therefore extracting relevant information and actionable insights from the latest research is becoming a considerable challenge. In this work we explore how techniques from natural language processing and machine learning can be used to automatically extract chemical insights from large collections of documents. We first describe how to download and process documents from a variety of sources - journal articles, conference proceedings (including NTREM), the US Patent & Trademark Office, and the Defense Technical Information Archive on archive.org. We present a custom NLP pipeline which uses open source NLP tools to identify the names of chemical compounds and relates them to function words (“underwater'', “rocket'”, “pyrotechnic'') and property words (“elastomer'', “non-toxic''). Relationships are obtained by doing computations with word vectors. After explaining how word embeddings work we compare the utility of two popular word embeddings - word2vec and GloVe. We show that word embeddings capture latent information about energetic materials, so that related materials appear close together in the word embedding space. Analytics on common compounds and topics for NTREM and other proceedings and how they have changed with time are also presented.

Keywords: natural language processing; machine learning; formulations; energetic materials


ID: 102, Contact: Peter W. Chung, pchung15@umd.edu NTREM 2019