A framework for linked data quality based on data profiling and rdf shape induction
- MIHINDUKULASOORIYA, NANDANA SAMPATH
- Raúl García Castro Director
- Asunción Gómez Pérez Director
Universidade de defensa: Universidad Politécnica de Madrid
Fecha de defensa: 22 de xuño de 2020
- Oscar Corcho García Presidente/a
- Víctor Rodríguez Doncel Secretario/a
- Riccardo Albertoni Vogal
- Nieves R. Brisaboa Vogal
- Mariano Fernández López Vogal
Tipo: Tese
Resumo
In the era of digital transformation, where most decision-making and artificial intelligence (AI) applications are becoming data-driven, data is becoming an essential asset. Linked Data, published in structured, machine-readable formats, with explicit semantics using Semantic Web standards, and with links to other data, is even more useful. The Linked (Open) Data cloud is growing with millions of new triples each year. Nevertheless, as we discuss in this thesis, such vast amounts of data bring several new challenges in ensuring the quality of Linked Data. The main goal of this thesis is to propose novel and scalable methods for automatic quality assessment and repair of Linked Data. The motivation for it is to significantly reduce the manual effort required by current quality assessment and repair, and to propose novel methods suitable for large-scale Linked Data sources such as DBpedia or Wikidata. The main hypothesis of this work is that data profiling metrics and automatic RDF Shape induction can be used to develop scalable and automatic quality assessment and repair methods. In this context, the following main contributions are delivered in this thesis: • LDQM, a Linked Data Quality Model for representing Linked Data quality in a standard manner and LD Sniffer, a tool based on LDQM for validating accessibility of Linked Data. LDQM contains 15 quality characteristics, 89 base measures, 23 derived measures, and 124 quality indicators. • Loupe, a framework for Linked Data profiling that includes the Loupe Extended Dataset Description Model and a suite of Linked Data profiling tools. The model consists of 84 Linked Data profiling metrics useful for quality assessment and repair tasks. Loupe tools have been used to evaluate 26 thousand datasets containing 34 billions of triples and Loupe contributed to the winning system of ISWC Semantic Web Challenge 2017. The Loupe Web portal has been visited more than 40,000 times by ~3000 unique visitors from 87 countries. • An automatic RDF Shape induction method that follows a data-driven approach to induce integrity constraints using data profiling metrics as features. The proposed method achieved an F1 of 98.81% in deriving maximum cardinality constraints, an F1 of 97.30% in deriving minimum cardinality constraints, and an F1 of 95.94% in deriving range constraints. • Four methods for automatic quality assessment and repair using RDF Shapes and data profiling metrics. They are motivated by several practical use cases that cover both Linked Data generation process and output and also cover both public and enterprise data. The four methods include (a) a method for detecting inconsistent mappings, (b) a method for detecting and eliminating noisy triples produced by open information extraction tools, (c) a method to repair links in RDF data, and (d) a method to complete type information in Linked Data. Each method demonstrates a high performance (~90% and above) in their respective tasks. Several research projects, such as 4V (TIN2013-46238-C4-2-R), LIDER (FP7-610782), 3Cixty (EIT Digital 14523), BNE (http://datos.bne.es/), and MappingPedia have already exploited the contributions of the thesis. In conclusion, we show that Linked Data research problems can learn from older paradigms, such as relational data. Through validating nine hypotheses related to the objectives of this thesis, we demonstrate that data profiling metrics can be used to develop scalable automatic methods for Linked Data quality assessment and repair with high accuracy.