Information Fusion and Ensembles in Machine Learning

Seijo Pardo, Borja

Information Fusion and Ensembles in Machine Learning

Seijo Pardo, Borja

Dirigida por:

Amparo Alonso Betanzos Directora
Verónica Bolón-Canedo Codirectora

Universidad de defensa: Universidade da Coruña

Fecha de defensa: 17 de diciembre de 2019

Tribunal:

Gavin Brown Presidente/a
Elena Hernández-Pereira Secretaria
João Gama Vocal

Departamento:

Ciencias de la Computación y Tecnologías de la Información

Tipo: Tesis

Teseo: 611951 DIALNET RUC editor

Resumen

Traditionally, machine learning methods have used a single learning model to solve a particular problem. However, the idea of combining multiple models instead of a single one to solve a problem has its rationale in the old proverb “Two heads are better than one". The approach constructs a set of hypothesis using several different models, that then are combined in order to be able to obtain better performance than learning just one hypothesis using a unique method. There have been several studies that have shown that these models obtain usually better accuracy than individual methods, due to the diversity of the approaches and the control of the variance, taking advantage of the strengths of the individual methods and overcome their weak points at the same time. These combinations of models are called “committees", or more recently “ensembles". Ensemble learning algorithms have reached great popularity among the machine learning literature, as they achieve performances that were not possible some years ago, and thus have become a “winning horse" in many applications. Moreover, during the last years, the size of the datasets used in the area of machine learning has considerably grown. Thus, dimensionality reduction has been a must almost in any case, and among those preprocesing methods, feature selection (FS) has become an essential preprocessing step for many data mining applications, eliminating irrelevant and redundant information, and thus reducing storage requirements and improving the computational time needed by the machine learning algorithms. Also, several studies have demonstrated that feature selection can greatly contribute to improve the performance of posterior classi_cation methods. One of the main points to be addressed in this thesis is the application of the ensemble learning idea to the feature selection process, with the aim of introducing diversity and increasing the regularity of the process. Regularity is the ability of the ensemble approach to obtain acceptable results regardless of the dataset under study and its particular properties. It should also be mentioned that using ensemble approaches has the added benefit of releasing the user from the task of selecting the most adequate method for each dataset, and thus of the obligation of knowing technical details about the existing algorithms. In this way, also more user-friendly FS methods are coming into scene. Ensembles for feature selection are a recent proposal, and not many works can be found in the literature. There are several steps that need to be confronted when creating an ensemble for FS: 1. Create a set of different feature selectors, each one providing its output. In order to create diversity, there are several methods that can be used, such as using different samples of the training dataset, using different feature selection methods, or a combination of both. 2. Aggregate the results obtained by the single models. There are several measures that can be used in this step, such as majority voting, weighted voting, etc. It is important to choose an adequate aggregation method, that is able to preserve the diversity of the individual base models, while maintaining accuracy. In this thesis, we have designed several approaches for the first aforementioned step: (i) homogeneous approach, that is, using the same feature selection method with different training data and distributing the dataset over several nodes (or several partitions); and (ii) heterogeneous approach, i.e., using different feature selection methods with the same training data. Regarding the second point above, we have also studied different methods for combining the results obtained from the individual methods. Besides, when the chosen individual selectors are rankers, at some point we needed to establish a threshold to retain only the relevant features and to combine the rankings obtained by the different methods configuring the ensemble. In this sense, we have analyzed two different proposals, depending on whether thresholding was performed before or after combination. Finally, a third novelty in this work is related to the need of establishing an adequate threshold, and thus we propose a methodology for establishing automatic thresholds based on measurements of data complexity. The adequacy of the methods proposed along this thesis was checked, so as to be able to extract a series of final conclusions. To this end, a variety of datasets of different types were used: synthetic, real “classical" (more samples than features) and real DNA microarray datasets (more features than samples). In a first step, synthetic datasets were used to perform the first tests and check the performance of the new implemented methods. In a second step, real datasets (both classical and microarray) were used to check the adequacy of new methods to problems presented in the real world, allowing us to carry out a performance comparison and also to extract a series of final conclusions. Finally, nowadays it is common to find missing data in real-world problems that the proposed feature selection ensembles, as any other machine learning method, are likely to face. Traditionally, the common way to deal with this situation was to delete those samples that contained missing data, but this is not possible when the percentages of missing data are important, and thus imputation is the newly common approach. However, imputation before FS can lead to false positives: features that are not associated with the target become dependent as a result of imputation. In this exploratory work we use causal graphs to evidence the notion of structural bias, and develop a modi- fied t-statistic test to analyze the possible bias that can be originated. Our conclusion is that it is more advisable to devise feature selection methods that are “robust" to the presence of missing data than imputing them. In this regard, the development of ensemble feature selection in this scenario remains as the future line to pursue.