Big Data Integration

More About Big Data Integration

Data integration has been the focus of research for many years now, motivated by data heterogeneity (arriving from multiple sources), the lack of sufficient semantics to fully understand the meaning of data, and errors that may stem from incorrect data insertion and modifications (e.g., typos and eliminations).

At the heart of the data integration process is a matching problem whose outcome is a collection of correspondences between different representations of the same real-world construct. Schema matching, process model matching, ontology alignment, and Web service composition are all examples of such problems. With a body of research that spans over multiple decades, data integration has a wealth of formal models of integration, algorithmic solutions for efficient and effective integration, and a body of systems, benchmarks and competitions that allow comparative empirical analysis of integration solutions.

In recent years, data integration has been facing new challenges as a result of the evolution of data accumulation, management, analytics, and visualization (a phenomenon known as big data). Big data is mainly about collecting volumes of data in an increased velocity from a variety of sources, each with its own level of veracity. Big data encompasses technological advancements such as Internet of things (accumulation), cloud computing (management), and deep learning (analytics), packaging it all together while providing an exciting arena for new and challenging research agenda.

Data integration, being mainly an offline process controlled by human experts, is ill-equipped to handle the constantly changing data sources, with new data sources being introduced frequently. Pay-as-you-go approaches to data integration allow partial integration of data sources, which means integration becomes a constant incremental process rather than a one-time, well-carfted one.

The research into big data integration, performed in the research lab of Professor Avigdor Gal, provides interesting insights into the role of human in matching, the use of non-binary evaluation measures, and the use of state-of-the-art machine learning to overcome semantic and cognitive challenges in the matching process.