Big data meets high performance computing: Genomics and natural language processing as case studies

Abuín Mosquera, José Manuel

Big data meets high performance computingGenomics and natural language processing as case studies

Abuín Mosquera, José Manuel

Supervised by:

Tomás F. Pena Director
Juan Carlos Pichel Campos Co-director

Defence university: Universidade de Santiago de Compostela

Fecha de defensa: 29 November 2017

Committee:

Ramón Doallo Chair
Luis Felipe Romero Gómez Secretary
Tandy Warnow Committee member

Type: Thesis

Teseo: 518853 DIALNET

Abstract

In recent years Big Data technologies have had a major boom in the industry and research in many areas. This is in part because of its ability to perform the computation of large volumes of data in parallel in a simple, efficient and totally transparent way for the user. In other words, Big Data technologies have brought classic parallel programming in distributed memory to a more general audience with a relative ease of use, and that is part of their success. On the other hand, in the field of High Performance Computing, or HPC, there is an inter-agency race to enter the Exascale era. Exascale refers to computer systems capable of at least one Exaflop per second. According to several experts in the area, HPC must converge with certain Big Data ideas to reach the Exascale. The main objective of this thesis is to clarify a way to the convergence between these two worlds. For this, a computational study of the application of Big Data and HPC technologies to two real world scientific problems is performed. These two problems, a priori, would fit well in both worlds, HPC and Big Data. With the results obtained in the works presented in this thesis, this road to convergence between HPC and Big Data can be clarified to some extent. These problems are the sequence alignment in genomics and the natural language processing. With respect to sequence alignment, the work performed encompasses the alignment of single and paired short read sequences, as well as the multiple sequence alignment. Alignment is one of the more expensive tasks in terms of computing time within several pipelines working in the area of genomics or bioinformatics. At the same time, it is one of the most basic and fundamental steps. Therefore, obtaining tools that perform this alignment in an efficient and scalable way is an indispensable requirement. In addition, these types of problems feed and generate a lot of data, so it is an ideal candidate for both HPC and Big Data. On the other hand, with respect to natural language processing (NLP), one of the major problems of this type of techniques is its high computational cost and its scalability difficulties. This is what makes them unfeasible for the analysis of large volumes (gigabytes and even terabytes) of documents. In this way, the use of High Performance Computing becomes indispensable if the user wants to significantly reduce the computation times, improve the system scalability, as well as the case of wanting to address problems of an even larger size. In this thesis parallelization/optimization techniques and Big Data have been applied to one of the NLP modules available in Linguakit, in order to measure the system scalability and to verify if Big Data technologies are really adequate to carry out this task. The module selected for this case study was Named Entity Recognition and Classification or NERC. Therefore, through the work presented in this thesis, the way to convergence between Big Data and High Performance Computing tries to be clarified. By doing this, a way to reach Exascale can be opened. Also, new tools that allow to carry out very important works within two nowadays scientific areas are developed. These new tools, which work in an efficient and scalable way represent a very important improvement to researchers in the aforementioned areas, as they can perform their daily work faster and more efficiently.