Using natural language processing methods developed over the past 30 years, computer programs are starting to pull out information from plain text, including research articles in scientific disciplines such as genomics and biomedicine. This capability can enable scientists to rapidly identify publications relevant to their own research as well as make scientific discoveries by scouring hundreds of research papers for associations and connections (such as between drugs and side effects, or genes and disease pathways) that humans reading each paper individually might not notice. Up to now, the use of NLP technologies has required considerable skill in the field; however, recent development of environments for constructing customizable NLP applications has opened the door for scientists to exploit NLP technologies for discovering and mining information from massive bodies of scientific publications such as those found in PubMed, PLoS, Web of Science, etc.
The Language Applications (LAPPS) Grid (http://www.lappsgrid.org) provides an infrastructure for rapid development of natural language processing applications (NLP) that uses the Galaxy platform as its workflow engine. The LAPPS Grid has integrated a wide range of NLP tools and resources into Galaxy, including popular public tools such as StanfordNLP, OpenNLP, NLTK, LingPipe, etc., and provided for using them interoperably in a “plug-and-play” workflow environment. The LAPPS Grid is an ideal platform to support mining scientific literature. The Galaxy interface and the interoperability among our tools together provide an intuitive and easy-to-use platform. Users can experiment with and exploit NLP tools and resources without the need to determine which are suited to a particular task, and without the need for significant computer expertise. In addition, because Galaxy already includes powerful analytic and visualization software for genomics research, information extracted from texts using the LAPPS Grid may be fed to these tools without leaving the platform. Finally, the LAPPS Grid is open source and free for use by anyone, and can be run from the web, on a user’s laptop or desktop, in the cloud, or as a self-contained docker image when it is necessary to protect sensitive data.
This tutorial will first introduce the student to information retrieval and extraction methods for discovering and mining scientific literature, and then provide an overview of the LAPPS Grid, as a background for two hands-on sessions.
Students will have access to several major scientific publication databases (PubMed, PLoS, Web of science, etc.) stored in the cloud on the LAPPS Grid’s JetStream instance, and will learn how to enable queries with Apache Solr for data discovery and mining. They will learn how to develop out-of-the-box workflows for information and relation extraction and adapt them to data for specific disciplines, for example by providing means to rapidly bootstrap custom dictionaries and gazetteers.
Nancy Ide is Professor of Computer Science at Vassar College in Poughkeepsie, New York, USA. She has been in the field of computational linguistics for over 30 years and made significant contributions to research in word sense disambiguation, computational lexicography, discourse analysis, and the use of semantic web technologies for language data. She is founder of the Text Encoding Initiative (TEI), the first major standard for representing electronic language data, and later developed the XML Corpus Encoding Standard (XCES). More recently, she co-developed the ISO LAF/GrAF representation format for linguistically annotated data. She has also developed major corpora for American English, including the Open American National Corpus (OANC) and the Manually Annotated Sub-Corpus (MASC), and has been a pioneer in efforts to foster tool and resource interoperability and open data and resources. Her most recent major project is the US NAtional Science Foundation-funded Language Applications (LAPPS) Grid, a platform including fully interoperable NLP tools from many projects that are commonly used in NLP and providing a workflow engine for rapid development of customized NLP applications. Professor Ide is Co-Editor-in-Chief of the journal Language Resources and Evaluation and Editor of the Springer book series Text, Speech, and Language Technology.