Web-scale, Schema-Agnostic, End-to-End Entity Resolution
- George Papadakis
- Themis Palpanas
Entity Resolution lies at the core of data integration, with a bulk of research focusing on both its effectiveness and time efficiency. Initially, most relevant works were crafted for structured, relational data that are described by a schema of well-known quality and meaning. With the advent of Big Data, though, these early schema-based approaches became inapplicable, as Entity Resolution is now applied to Web Data collections, which abound in semi-structured, inherently voluminous and highly heterogeneous information.
To address the inherent challenges of Web Data, recent works on Entity Resolution adopt a novel, schema-agnostic functionality that emphasizes scalability and robustness to noise. In this tutorial, we take a close look on this line of research, organizing the state-of-the-art in the field into a scalable, schema-agnostic end-to-end workflow that consists of 4 steps.
The first two focus on improving time efficiency through blocking, while the last two steps are dedicated to effectiveness: (i) Block Building clusters similar entities into blocks so as to restrict the originally quadratic complexity to comparing just pairs of entities that are highly likely to be matching. (ii) Block Processing further cuts down on the computational cost by discarding pairwise comparisons that are repeated or lack sufficient evidence for producing duplicates. (iii) Entity Matching carries out all comparisons in the final set of blocks, creating a similarity graph with a node for every entity and a weighted edge for every pair of compared entities. (iv) Entity Clustering partitions the nodes of the similarity graph into equivalence clusters such that every cluster contains all resources that correspond to the same real-world object.
The tutorial concludes with a hands-on session that involves our publicly available reference toolbox for Entity Resolution, demonstrating the relative performance of the main state-of-the-art techniques. Thus, the participants will put in practice all the topics discussed in theory.