The tutorial aims at introducing the participants to Hadoop and Spark, popular open-source frameworks for reliable and scalable distributed big data processing. Map-Reduce, the programming paradigm sitting at the core of distributed batch data processing on Hadoop, is described and explained by showcasing some frequent NLP specific tasks. Then, everybody will embark on a ride through the Hadoop ecosystem with short stops at HDFS, the MapReduce API and relational-like data processing tools such as Apache Pig and Apache Hive. The final stop will be Apache Spark, a general-purpose compute engine that can run on top of Hadoop. With a clear picture and understanding of these distributed computing frameworks in mind, several project ideas and approaches involving processing of repositories of bio-medical research articles will conclude the presentation.
The participants will solve specific tasks using the frameworks described in the tutorial (running within a virtual machine on personal computers or on AWS-EMR), involving textual data from bio-medical scientific literature.
Mihaela Breabăn obtained her PhD in Computer Science in 2011 from the Alexandru Ioan Cuza University of Iasi. Her doctoral and post-doctoral research was conducted in the field of Data Mining, with a focus on unsupervised analysis. She is currently an Associate Professor at the Faculty of Computer Science within Alexandru Ioan Cuza University, teaching topics in Databases, Data Mining, Big Data Analytics and Natural Computing.