Performance optimisation of biological pathway data storage, retrieval, analysis and its interactive visualisation

Fabregat Mundo, Antonio

Performance optimisation of biological pathway data storage, retrieval, analysis and its interactive visualisation

Fabregat Mundo, Antonio

Supervised by:

Pablo Marín García Director
Vicente Arnau Llombart Co-director

Defence university: Universitat de València

Fecha de defensa: 27 July 2018

Committee:

Amparo Alonso Betanzos Chair
Pedro Morillo Tena Secretary
Alfonso Jaramillo Rosales Committee member

Type: Thesis

Teseo: 566433 DIALNET TESEO editor

Abstract

The aim of this research was to optimise the performance of the storage, retrieval, analysis and interactive visualisation of biomolecular pathways data. This was achieved by the adoption of new technologies and a variety of highly optimised data structures, algorithms and strategies across the different layers of the software. The first challenge to overcome was the creation of a long-lasting, large-scale web application to enable pathways navigation; the Pathway Browser. This tool had to aggregate different modules to allow users to browse pathway content and use their own data to perform pathway analysis. Another challenge was the development of a high-performance pathway analysis tool to enable the analysis of genome-wide datasets within seconds. Once developed, it was also integrated into the Pathway Browser allowing interactive exploration and analysis of high throughput data. The Pathways Overview layout and widget were created to enable the representation of the complex parent-child relationships present in the pathways hierarchical organisation. This module provides a means to overlay analysis results in such a way that the user can easily distinguish the most significant areas of biology represented in their data. Although an existing force-directed layout algorithm was initially utilised for the graphical representation, it did not achieve the expected results and a custom radial layout algorithm was developed instead. A new version of the pathway Diagram Viewer was engineered to achieve loading and rendering of 97% of the target diagrams in less than 1 second. Combining the multi-layer HTML5 Canvas strategy with a space partitioning data structure minimised CPU workload, enabling the introduction of new features that further enhance user experience. On the server side, the work focused on the adoption of a graph database (Neo4j) and the creation of the new Content Service (REST API) that provides access to these data. The Neo4j graph database and its query language, Cypher, enabled efficient access to the complex pathway data model, facilitating easy traversal and knowledge discovery. The adoption of this technology greatly improved query efficiency, reducing the average query time by 93%.