Evolutionary computation methods for protein structure prediction and for the computational modeling of the protein folding process

Varela, Daniel

Evolutionary computation methods for protein structure prediction and for the computational modeling of the protein folding process

Varela, Daniel

Supervised by:

José Santos Reyes Director

Defence university: Universidade da Coruña

Fecha de defensa: 29 November 2019

Committee:

María Camino Rodríguez Vela Chair
Richard J. Duro Fernández Secretary
Martín Diéguez Lodeiro Committee member

Department:

Computer Science and Information Technologies

Type: Thesis

Teseo: 608015 DIALNET RUC editor

Abstract

This thesis was focused on the use of hybrid evolutionary computation methods for the Protein Structure Prediction (PSP) problem, as well as a first attempt to model protein folding with machine learning, using again evolutionary computation as an optimization method to infer the folding model. Given the increasing sequence/structure gap in protein knowledge, the computational methods of PSP are crucial to tackle the problem. The work developed in this thesis is included in the “ab initio” prediction, which is the most challenging since it relies only on the information of the protein sequence to determine the native structure. In this case of ab initio, the energy landscapes/ search spaces associated with protein conformational representation models are high dimensional and rugged, in atomic models and even in simplified lattice models of protein representation. Therefore, the research has been focused on the use of metaheuristics in order to sample the vast conformational space, metaheuristics which usually are included in the bio-inspired or natural computation field. In this thesis a first combination between the global search of a robust evolutionary algorithm (Differential Evolution - DE) and local search techniques was used for the PSP problem. This hybrid version was initially defined for a detailed lattice model for protein representation (Face Centered Cubic lattice - FCC). The defined local greedy search selects, in each move between two consecutive amino acids, the move that minimizes the energy of the consequent protein conformation. That local search can be used to refine the trial conformations provided by the DE genetic operators as well as the conformations of the genetic population, following a classic Lamarckian combination that can minimize the number of fitness evaluations in the search for optimized solutions (protein conformations). Next, the previous combination was extended to tackle the same problem in an atomic model, using the coarse-grained representation of the Rosetta system, one of the most successful software environments for protein design. The extension of the previous work with the FCC model considers the adaptation of the DE optimization to the new protein representation with dihedral angles. The Rosetta’s fragment insertion technique was used to implement a local search procedure, technique which can locally refine the dihedral angles of a protein conformation. Therefore, with the same elements as in the previous case (FCC lattice model), the defined hybrid version follows the same ideas with a Lamarckian combination between DE and the local search. The results with resolved proteins from the Protein Data Bank show that the hybrid version obtains lower energy conformations in comparison with previous works and with Rosetta ab initio protocol and under the same number of fitness/energy evaluations. Moreover, the same hybrid DE version was used to take into account the information of cryo-electron microscopy (cryo-EM) maps. In this case, the adapted hybrid version allows the refinement of protein conformations using the information of cryo-EM and in order to obtain closer structures to the native one. Nevertheless, in atomic models, a problem can appear when the energy landscape is deceptive. In the Rosetta case, it means that the global minimum in the high dimensional energy landscape does not necessarily correspond to the native structure. We approach the problem with a possibility in evolutionary computation that was not directly considered in this application. The possibility is the use of niching methods in a multimodal energy landscape, methods that can locate the individuals of the population in the most promising areas (niches) of the fitness landscape. Thus, the hybrid DE algorithm was combined with classic evolutionary computation niching methods (crowding, fitness sharing and speciation), with the aim of obtaining protein conformations that correspond to decoys in different niches and with different folds. The developed methods allow obtaining potential solutions at the final generations with a diverse set of folds with different distances (RMSD) from the real native conformation. The last part of the thesis is related to the protein folding process and how to model it from a pure machine learning approach. For the modeling, the folding process was considered as an emergent process, result of the interactions through time between the protein components. Therefore, we relied on classic tools like Cellular Automata (CA) to model such an emergent process. The classic CA rule set was extended since it was implemented with Artificial Neural Networks (ANNs), and the ANNs were automatically obtained by an evolutionary algorithm. The optimized CA/ANNs define the local changes in a protein conformation through time until a final conformation is reached. The proposed methods were implemented with lattice models, using again the detailed FCC model and, next, all the methodology was extended to the coarse-grained atomic representation of Rosetta. Limitations appear in such a modeling, but even with the simplifications and limitations found, the work performed can be considered a first attempt of using machine learning to automatically obtain a model of the folding process.