Fault tolerance configuration and management for hpc applications using radic architecture

Villamayor Leguizamón, Jorge Luis

Fault tolerance configuration and management for hpc applications using radic architecture

Villamayor Leguizamón, Jorge Luis

Dirigida por:

Dolores Isabel Rexachs del Rosario Director/a

Universidad de defensa: Universitat Autònoma de Barcelona

Fecha de defensa: 30 de noviembre de 2018

Tribunal:

Juan Touriño Presidente
Andrés Gómez Tato Secretario/a
Jan Kwiatkowski Vocal

Tipo: Tesis

Teseo: 575447 DIALNET TDX editor

Resumen

High Performance Computing (HPC) systems continue growing exponentially in terms of components quantity and density to achieve demanding computational power. At the same time, cloud computing is becoming popular, as key features such as scalability, pay-per-use and availability continue to evolve. It is also becoming a competitive platform for running parallel HPC applications due to the increasing performance of virtualized, highly-available instances. Although, augmenting the amount of components to create larger systems tends to increment the frequency of failures in both clusters and cloud environments. Nowadays, HPC systems have a failure rate of around 1000 per year, meaning a failure every approximately 8 hours. Most of the parallel distributed applications are built on top of a Message Passing Interface (MPI). MPI implementations follow a default fail-stop semantic, which aborts the execution in case of host failure in a cluster. In this case, the application owner needs to restart the execution, which affects the wall clock time and, also, the cost since it requires to acquire computing resources for longer periods of time. Fault Tolerance (FT) techniques need to be applied to MPI parallel executions in both, cluster and cloud environments. With FT techniques, high availability is ensured for parallel applications. In order to apply some FT solutions, administrator privileges are required, to install them in the cluster nodes. Moreover, when failures appear human intervention is required to recover the application. A solution, which minimizes users and administrators intervention is preferred. A contribution of this thesis is a Fault Tolerance Manager (FTM) for coordinated checkpoint, which provides the application's users with automatic recovery from failures when losing computing nodes. It takes advantage of node local storage to save checkpoints, and it distributes copies of them along all the computation nodes, avoiding the bottleneck of a central stable storage. We also leverage the FTM to use uncoordinated and semi-coordinated rollback recovery protocols. In this contribution, FTM is implemented in the application-layer. Furthermore, a dynamic resource controller is added to the FTM, which monitors the FT protection resource usage and performs actions to maintain an acceptable level of protection. Another contribution aims to the FT protection and recovery tasks configuration. Two models are introduced. The First Protection Point model (FPP) determines the starting point to introduce FT protection gaining benefits in terms of total execution time including failures, by removing unnecessary checkpoints. The second model allows improving the FT resource configuration for the recovery task. Regarding cloud environments, we propose Resilience as a Service (RaaS), a fault tolerant framework for HPC applications, which uses FTM. RaaS provides clouds with a highly available, distributed and scalable fault-tolerant service. It redesigns traditional HPC protection and recovery mechanisms, to natively leverage cloud capabilities and its multiple alternatives for implementing FT tasks. To summarize, this thesis contributes on providing a multi-platform resilience manager, suitable for traditional bare-metal clusters and clouds (public and private). The presented solution provides FT in an automatic, distributed and transparent manner in the application and user levels according to the users, applications, and runtime requirements. It gives the users critical FT information, allowing them to trade-off cost and protection keeping the mean time to repair within acceptable ranges. Several experimental environments such as bare-metal clusters and cloud (public and private), running different parallel applications were used during the experimental validations. The experiments verify the functionality and improvement of the contributions. Moreover, they also show that the Mean Time To Repair is bounded within acceptable ranges.