Solving matching problems in computer science entails generating alignments between structured data. Well known examples are schema matching, process model matching, ontology alignment, and Web service composition. Design of software systems aimed at solving these problems, and reﬁnement of interim results, are aided by solution quality evaluation measures.
We base our exploration in the schema matching domain. Schema matching is the task of providing correspondences between concepts describing the meaning of data in various heterogeneous, distributed data sources (e.g. attributes in database schemas, tags in XML DTDs, ﬁelds in HTML forms, etc.) Schema matching was recognized to be one of the basic operations required by the process of data and schema integration but has since been adopted by a wide range of applications as a basic method for matching various representations of data.
Schema matching research at the Technion starts focuses on introducing new matching theories based on which new and better heuristics for schema matching can be developed. We invest in the development of new evaluation measures for schema matching, as well as developing a model for predicting the performance of machine and human matchers.
OntoBuilder is a matching tool supported by the research group at the Technion. OntoBuilder provides ORE, OntoBuilder Research Environment, which allows researchers access to common schema matching heuristics as well as datasets for performing benchmark empirical evaluation
T. Sagi, A. Gal - Schema Matching Prediction with Applications to Data Source Discovery and Dynamic Ensembling. VLDB Journal, 22(5):689-710, September 2013: Web-scale data integration involves fully automated efforts which lack knowledge of the exact match between data descriptions. In this paper, we introduce schema matching prediction, an assessment mechanism to support schema matchers in the absence of an exact match. Given attribute pair-wise similarity measures, a predictor predicts the success of a matcher in identifying correct correspondences. We present a comprehensive framework in which predictors can be defined, designed, and evaluated. We formally define schema matching evaluation and schema matching prediction using similarity spaces and discuss a set of four desirable properties of predictors, namely correlation, robustness, tunability, and generalization. We present a method for constructing predictors, supporting generalization, and introduce prediction models as means of tuning prediction toward various quality measures. We define the empirical properties of correlation and robustness and provide concrete measures for their evaluation. We illustrate the usefulness of schema matching prediction by presenting three use cases: We propose a method for ranking the relevance of deep Web sources with respect to given user needs. We show how predictors can assist in the design of schema matching systems. Finally, we show how prediction can support dynamic weight setting of matchers in an ensemble, thus improving upon current state-of-the-art weight setting methods. An extensive empirical evaluation shows the usefulness of predictors in these use cases and demonstrates the usefulness of prediction models in increasing the performance of schema matching.
A. Gal - Uncertain Schema Matching. Morgan & Claypool Publishers, 2011:Schema matching is the task of providing correspondences between concepts describing the meaning of data in various heterogeneous, distributed data sources. Schema matching is one of the basic operations required by the process of data and schema integration, and thus has a great effect on its outcomes, whether these involve targeted content delivery, view integration, database integration, query rewriting over heterogeneous sources, duplicate data elimination, or automatic streamlining of workflow activities that involve heterogeneous data sources. Although schema matching research has been ongoing for over 25 years, more recently a realization has emerged that schema matchers are inherently uncertain. Since 2003, work on the uncertainty in schema matching has picked up, along with research on uncertainty in other areas of data management. This lecture presents various aspects of uncertainty in schema matching within a single unified framework. We introduce basic formulations of uncertainty and provide several alternative representations of schema matching uncertainty. Then, we cover two common methods that have been proposed to deal with uncertainty in schema matching, namely ensembles, and top-K matchings, and analyze them in this context. We conclude with a set of real-world applications.Table of Contents: Introduction / Models of Uncertainty / Modeling Uncertain Schema Matching / Schema Matcher Ensembles / Top-K Schema Matchings / Applications / Conclusions and Future Work.
A. Gal, T. Sagi - Tuning the Ensemble Selection Process of Schema Matchers. Information Systems, 35(8):845-859. 2010:Schema matching is the task of providing correspondences between concepts describing the meaning of data in various heterogeneous, distributed data sources. It is recognized to be one of the basic operations required by the process of data and schema integration and its outcome serves in many tasks such as targeted content delivery and view integration. Schema matching research has been going on for more than 25 years now. An interesting research topic, that was largely left untouched involves the automatic selection of schema matchers to an ensemble, a set of schema matchers. To the best of our knowledge, none of the existing algorithmic solutions offer such a selection feature. In this paper we provide a thorough investigation of this research topic. We introduce a new heuristic, Schema Matcher Boosting (SMB). We show that SMB has the ability to choose among schema matchers and to tune their importance. As such, SMB introduces a new promise for schema matcher designers. Instead of trying to design a perfect schema matcher, a designer can instead focus on finding better than random schema matchers. For the effective utilization ofSMB, we propose a complementary approach to the design of new schema matchers. We separate schema matchers into first-line and second-line matchers. First-line schema matchers were designed by-and-large as applications of existing works in other areas (e.g., machine learning and information retrieval) to schemata. Second-line schema matchers operate on the outcome of other schema matchers to improve their original outcome. SMB selects matcher pairs, where each pair contains a first-line matcher and a second-line matcher. We run a thorough set of experiments to analyze SMB ability to effectively choose schema matchers and show that SMB performs better than other, state-of-the-art ensemble matchers.
C. Domshlak, A. Gal, H. Roitman - Rank Aggregation for Automatic Schema Matching. IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(4):538-553, 2007:Schema matching is a basic operation of data integration, and several tools for automating it have been proposed and evaluated in the database community. Research in this area reveals that there is no single schema matcher that is guaranteed to succeed in finding a good mapping for all possible domains and, thus, an ensemble of schema matchers should be considered. In this paper, we introduce schema metamatching, a general framework for composing an arbitrary ensemble of schema matchers and generating a list of best ranked schema mappings. Informally, schema metamatching stands for computing a "consensus" ranking of alternative mappings between two schemata, given the "individual" graded rankings provided by several schema matchers. We introduce several algorithms for this problem, varying from adaptations of some standard techniques for general quantitative rank aggregation to novel techniques specific to the problem of schema matching, and to combinations of both. We provide a formal analysis of the applicability and relative performance of these algorithms and evaluate them empirically on a set of real-world schemata.
A. Gal - Why is Schema Matching Tough and What Can We Do About It? SIGMOD Record, 35(4):2-5, 2007:In this paper we analyze the problem of schema matching, explain why it is such a "tough" problem and suggest directions for handling it effectively. In particular, we present the monotonicity principle and see how it leads to the use of top-K mappings rather than a single mapping.
A. Gal - Managing Uncertainty in Schema Matching with Top-K Schema Mappings. Journal onData Semantics, 6:90-114, 2006:In this paper, we propose to extend current practice in schema matching with the simultaneous use of top-K schema mappings rather than a single best mapping. This is a natural extension of existing methods (which can be considered to fall into the top-1 category), taking into account the imprecision inherent in the schema matching process. The essence of this method is the simultaneous generation and examination of K best schema mappings to identify useful mappings. The paper discusses efficient methods for generating top-K methods and propose a generic methodology for the simultaneous utilization of top-K mappings. We also propose a concrete heuristic that aims at improving precision at the cost of recall. We have tested the heuristic on real as well as synthetic data and anlyze the emricial results.
The novelty of this paper lies in the robust extension of existing methods for schema matching, one that can gracefully accommodate less-than-perfect scenarios in which the exact mapping cannot be identified in a single iteration. Our proposal represents a step forward in achieving fully automated schema matching, which is currently semi-automated at best.
A. Gal, A. Segev, C. Tatsiopoulos, K. Sidiropoulos, P. Georgiades - Agent Oriented Data Integration. Lecture Notes in Computer Science, Springer 2005:Data integration is the process by which data from heterogeneous data sources are conceptually integrated into a single cohesive data set. In recent years agents have been increasingly used in information systems to promote performance. In this work we propose a modeling framework for agent oriented data integration to demonstrate how agents can support this process. We provide a systematic analysis of the process using real world scenarios, taken from email messages from citizens in a local government, and demonstrate two agent oriented data integration tasks, email routing and opinion analysis.
A. Gal, A. Anaby-Tavor, A. Trombetta, D. Montesi - A Framework for Modeling and Evaluating Automatic Semantic Reconciliation. VLDB Journal, 14(1):50-67, 2005:The introduction of the Semantic Web vision and the shift toward machine understandable Web resources has unearthed the importance of automatic semantic reconciliation. Consequently, new tools for automating the process were proposed. In this work we present a formal model of semantic reconciliation and analyze in a systematic manner the properties of the process outcome, primarily the inherent uncertainty of the matching process and how it reflects on the resulting mappings. An important feature of this research is the identification and analysis of factors that impact the effectiveness of algorithms for automatic semantic reconciliation, leading, it is hoped, to the design of better algorithms by reducing the uncertainty of existing algorithms. Against this background we empirically study the aptitude of two algorithms to correctly match concepts. This research is both timely and practical in light of recent attempts to develop and utilize methods for automatic semantic reconciliation.
A. Gal, G. Modica. H. Jamil, A. Eyal - Automatic Ontology Matching using Application Semantics. AI Magazine. 26(1):21-32, 2005:We propose the use of application semantics to enhance the process of semantic reconciliation. Application semantics involves those elements of business reasoning that affect the way concepts are presented to users: their layout, and so on. In particular, we pursue in this article the notion of precedence, in which temporal constraints determine the order in which concepts are presented to the user. Existing matching algorithms use either syntactic means (such as term matching and domain matching) or model semantic means, the use of structural information that is provided by the specific data model to enhance the matching process. The novelty of our approach lies in proposing a class of matching techniques that takes advantage of ontological structures and application semantics. As an example, the use of precedence to reflect business rules has not been applied elsewhere, to the best of our knowledge. We have tested the process for a variety of web sites in domains such as car rentals and airline reservations, and we share our experiences with precedence and its limitations.