Integration of data has been the focus of research for many years now. At the data level, entity resolution (also known as record deduplication) aims at "cleaning'' a database by identifying tuples that represent the same entity. The need for data integration stems from the heterogeneity of data (arriving from multiple sources), the lack of sufficient semantics to fully understand the meaning of data, and errors that may stem from incorrect data insertion and modifications (e.g., typos and eliminations). With a body of research that spans over multiple decades, entity resolution has a wealth of formal models of integration, algorithmic solutions for efficient and effective integration, and a body of systems, benchmarks and competitions that allow comparative empirical analysis of solutions.

The research group at the Technion focuses on finding efficient ways to perform blocking, clustering the dataset to ensure efficient entity resolution. We also focus on uncertainty analysis of the entity resolution process, putting emphasis on data veracity.

people

- Avigdor Gal

- Roee Shraga

- Bar Genossar

- Jonathan Svirsky - Alumni

- Batya Kenig - Alumni

papers

papers

B. Kenig, A. Gal - MFIBlocks: An Effective Blocking Algorithm for Entity Resolution. Information Systems. 38(6):908-926, September 2013

Entity resolution is the process of discovering groups of tuples that correspond to the same real-world entity. Blocking algorithms separate tuples into blocks that are likely to contain matching pairs. Tuning is a major challenge in the blocking process and in particular, high expertise is needed in contemporary blocking algorithms to construct a blocking key, based on which tuples are assigned to blocks. In this work, we introduce a blocking approach that avoids selecting a blocking key altogether, relieving the user from this difficult task. The approach is based on maximal frequent itemsets selection, allowing early evaluation of block quality based on the overall commonality of its members. A unique feature of the proposed algorithm is the use of prior knowledge of the estimated size of duplicate sets in enhancing the blocking accuracy. We report on a thorough empirical analysis, using common benchmarks of both real-world and synthetic datasets to exhibit the effectiveness and efficiency of our approach.

@article{KENIG2013,
author = {Batya Kenig and Avigdor Gal},
title = {MFIBlocks: An effective blocking algorithm for entity resolution},
journal = {Inf. Syst.},
volume = {38},
number = {6},
year = {2013},
pages = {908-926}, }

B. Kenig, A. Gal, O. Strichman. A New Class of Lineage Expressions over Probabilistic Databases computable in P-time. Proceedings of the 7th
@INPROCEEDINGS{KENIG2013,
AUTHOR="B. Kenig and A. Gal and O. Strichman",
TITLE="A New Class of Lineage Expressions over Probabilistic Databases computable in P-time",
BOOKTITLE="Proceedings of the $7^{th}$ International Conference on Scalable Uncertainty Management (SUM’2013)",
YEAR=2013,
ADDRESS="Washington DC, USA",
MONTH=sep}
International Conference on Scalable Uncertainty Management (SUM'2013). Washington DC, USA, September 16-18, 2013.

We study the problem of query evaluation over tuple-independent probabilistic databases. We define a new characterization of lineage expressions called disjoint branch acyclic, and show this class to be computed in P-time. Specifically, this work extends the class of lineage expressions for which evaluation can be performed in PTIME. We achieve this extension with a novel usage of junction trees to compute the probability of these lineage expressions.

 

presentations

presentations

Uncertain Entity Resolution: VLDB'2014 Tutorial

Entity Resolution in the Big Data Era: Probabilistic DB Support to Entity Resolution

code

code
he code implements the MFIBlocks algorithm (MFIB) in two stages.
The first stage transforms the source-data into a dataset of q-gram ids.
At the second stage, the output of this stage is then used in order to execute the blocking algorithm.
Detailed instructions and parameters for running the algorithm are available on the Wiki page of the project's Github repository.

datasets

datasets

Real-world datasets:SecondStringtoolkit

Synthetic datasets were generated usingFEBRL