Evolutionary Computation for Overlapping Community Detection in Social and Graph-based Information
- Gema Bello Orgaz
- David Camacho Fernández Director/a
Universidad de defensa: Universidad Autónoma de Madrid
Fecha de defensa: 26 de junio de 2017
- Amparo Alonso Betanzos Presidenta
- Antonio González Pardo Secretario/a
- Paulo Novais Vocal
- Óscar Cordón García Vocal
- Sancho Salcedo Sanz Vocal
Tipo: Tesis
Resumen
The Community Detection Problem (CDP) in Social Networks has been widely studied from different areas such as Data Mining, Graph Mining or Social Network Analysis, amongst others. Nowadays, this problem has become highly relevant due to the growing interest in Social Networks, and its possible application in several disciplines such as sociology, biology, neuroscience, or computer science, whose information can be easily represented as networks or graphs. This problem can be formulated as a graph-clustering problem, where the nodes belonging to a graph should be partitioned into groups according to the network topology. The partitioning of a graph is usually a division of the graph where each node belongs to only one community. However, a common feature observed in real-world networks is the existence of overlapping communities, where a particular node can belong to more than one community. When the Community Detection Algorithms (CDA) are applied to process social-based information, it is quite usual for a node of the network to belong to different groups, or communities. Therefore, it is quite interesting that this type of algorithms can deal with this feature, detecting communities of nodes with overlaps. This dissertation has been focused on the design of an evolutionary approach, which can provide new methods to find overlapping and stable communities in a graph, improving the currently existing techniques in the state of the art. For this purpose, several encoding approaches have been designed, as well as several fitness functions that guide the searching process using some measures related to the graph theory. The different algorithms designed and implemented in this theses have been divided in two generations. In order to combine the basic idea of classical clustering methods with the graph clustering methods, a first generation of algorithms have been designed. These algorithms adopt an evolutionary approach to optimize the search of communities, using different fitness functions which combine distances metrics based on features of the nodes, with measures related to the network topology of the graph (GCF). Analysing the experimental results obtained by these algorithms, it can be noticed that they present the same problem as the classical CDAs, related to their dependence on the graph structure. Therefore, to handle with this problem, these algorithms have been extended trying to promote solution diversity, in a new generation of algorithms. The second generation of algorithms designed are based on Multi-Objective Genetic Algorithms (MOGA-OCD). The fitness function implemented in MOGA-OCD algorithm optimizes two classical objective functions in CDP; the first one is used to maximize the internal connectivity of the communities, whereas the second one is used to minimize the external connections to the rest of the graph. To select the most appropriate metrics for these objective functions, a comparative assessment of several connectivity metrics has been carried out using a real-world network. Finally, the algorithm has been evaluated against other well-known algorithms from the state of the art in CDP. The experimental results show that the proposed algorithms obtain good results for the different types of graphs with different structures. In addition, they improve overall the accuracy and quality of current approaches in CDP, showing its effectiveness as a new method for detecting overlapping communities. Finally, and in order to evaluate the CDA algorithms designed in a real domain, this dissertation also includes the application of them to identify opinion communities related to Public Healthcare topics in Social Networks. Vaccines have contributed to dramatically decrease mortality from infectious diseases in the 20th century. However, several social discussion groups related to vaccines have emerged, influencing the opinion of the population about vaccination for the past 20 years. These communities discussing about vaccines have taken advantage of social media to effectively disseminate their theories. Nowadays, recent outbreaks of preventable diseases such as measles, polio, or influenza, have shown the effect of a decrease in vaccination rates. Social Networks are one of the most important sources of Big Data. Specifically, Twitter generates over 400 million tweets every day. This thesis shows how CDAs can be applied to discover and track discussion communities on vaccination using information gathered from Twitter. In addition, this fact provides useful information that Public Healthcare Organizations may try to use for avoiding or mitigating new outbreaks of eradicated diseases.