Efficient communication management in cloud environments

  1. Espínola Brítez, Laura María
Supervised by:
  1. Daniel Franco Puntes Director

Defence university: Universitat Autònoma de Barcelona

Fecha de defensa: 30 November 2018

Committee:
  1. María Inmaculada García Fernández Chair
  2. Juan Touriño Secretary
  3. Jan Kwiatkowski Committee member

Type: Thesis

Teseo: 575448 DIALNET lock_openTDX editor

Abstract

Scientific applications with High Performance Computing (HPC) requirements are migrating to cloud environments due to the facilities that it offers. Cloud computing plays a major role considering the compute power that it provides, avoiding the cost of physical cluster maintenance. With features like elasticity and pay-per-use, it helps to reduce the researchers procurement risk. Most of HPC applications are implemented using Message Passing Interface (MPI), which is a key component in common and distributed computing tasks. However, for this kind of applications on cloud environments, the major drawback is the lost of execution performance, due to the virtualized network that affects the communications latency and bandwidth. To use a cloud environment with scientific applications of this kind, low latency communication mechanisms are required. The network topology detail is not available for users in virtualized environments, making difficult to use the existing optimizations based on network topology information done in bare-metal cluster environments. In some cases, cloud providers can migrate virtual machines, which impacts the efficiency of routing optimizations and placement algorithms. Moreover, if resource isolation is not guaranteed, resource sharing can lead to variable bandwidth and unstable performance. In this thesis a Dynamic MPI Communication Balance and Management (DMCBM) is presented, to overcome the communication challenge of HPC applications in cloud. DMCBM is implemented as a middle-ware between the users application and the execution environment. It improves message communication latency times in cloud- based systems, and helps users to detect mapping and parallel implementation issues. Our solution dynamically rebalances communication flows at higher levels of the virtualized HPC stack, e.g. over MPI communications layer, to dynamically remove communication hot-spots and congestion in the underlying layers. DMCBM abstracts the communications state between application processes based on latency measurements. This middleware characterizes the underlying network topology and analyzes parallel applications behavior in the cloud. This allows for detecting network congestion and optimizing communications by either selecting alternative communication paths between processes, or leveraging live migration of virtual machines in cloud environments. These options are analyzed in real-time and selected according to the type of congestion (link or destination). DMCBM achieves lower application execution time in case of congestion, obtaining better performance in clouds. Finally, experiments that verify the functionality and improvements of DMCBM with MPI Applications in public and private clouds are presented. The experiments where done by measuring execution and communication times. NAS Parallel Benchmarks and a real application of dynamic particles simulation NBody are used, obtaining an improvement of up to 10% in the execution time and a communication time reduction of about 40% in congestion scenarios.