Exploiting context analysis for combining multiple entity resolution systems
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on
  • Presentation posted in: General

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems. Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra. -Ramu Bandaru. Table of Contents. Introduction Related Work Problem Definition

Download Presentation

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Exploiting context analysis for combining multiple entity resolution systems

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems

Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra

-Ramu Bandaru


Table of contents

Table of Contents

  • Introduction

  • Related Work

  • Problem Definition

  • Context based Framework

  • Experimental Evaluation

  • Conclusion


Introduction

Introduction

  • This paper proposes a new Entity Resolution Ensemble framework.

  • The real world raw datasets are often not perfect and contain various types of issues, including errors and uncertainty, which complicate analysis.

  • The task of ER Ensemble is to combine the results of multiple base-level ER systems into a single solution with the goal of increasing the quality of ER.

  • The goal of ER is to identify and group references that

    co-refer. The output of an ER system is a clustering of references, where each cluster is supposed to represent one distinct entity.


Exploiting context analysis for combining multiple entity resolution systems

  • There are diverse ER approaches have been proposed to address the entity resolution challenge. This motivates the problem of creating a single combined ER solution – an ER ensemble.

  • Different ER solutions perform better in different contexts. No solution is the best in terms of Quality.

  • ER ensemble actively uses the notion of the correct clustering A+ for D.

  • Each system Si is applied to the dataset being processed to produce its clustering A(i).

  • Clustering A(i) corresponds to what system Si believes A+ should be.


Exploiting context analysis for combining multiple entity resolution systems

  • The goal of ER ensemble is to build a clustering A* with the highest quality which is as close to A+ as possible.

  • Difference between the two ensemble problems:

  • The objective of the cluster ensemble will be to find a clustering that agrees the most with the n clusterings, including the n−1 poor-quality clusters.

  • The goal of the ER ensemble is to simply output A(n) as its A*,and completely ignore the rest of the clusterings.

    • LetKbe the true number of clusters

      if (K is large) then high threshold

      if (K is small) then low threshold


Exploiting context analysis for combining multiple entity resolution systems

  • Two novel ER ensemble approaches.

    •A predictive way of selecting context features that

    relies on estimating the unknown parameters of the

    data being processed.

    •An extensive empirical evaluation of the proposed

    solution.


Related work

Related Work

  • Entity Resolution : Many of the approaches have focused on exploiting various similarity metrics, relational similarities, probabilistic models.

  • Clustering Ensembles : The final clustering is usually the one that agrees the most with the given clusterings.

  • Combining Classifiers:

    • Majority Voting-popular rule

    • Bagging & Boosting-combine classifiers of same type.

    • Stacking is to perform cross-validation on the

    base-level dataset and create a meta-level dataset

    from the predictions of base-level classifiers.


Problem definition

Problem Definition

Entity Resolution

  • Two references ri and rj are said to co-refer, denoted as

    ri~ rj , if they refer to the same object, that is, if ori = orj.

  • The co-reference set Cr for r is the set of all references that co-refer with r ie., Cr = {q belongs to R:q~r}.

  • The output of an ER system S applied to dataset D is a set of clusters A = {A1,A2, . . . ,A|A|}.

  • To assess the quality of the resulting clusterings, we evaluate how different the clustering A is from the ground truth A+.


Er ensemble

ER Ensemble

  • These systems can be applied to dataset D to cluster the set of references R. Each system Si will produce its set of clusters A(i) = {A1(i) ,A(i)2 , . . . ,A(i)|Ai|}.

  • The goal is to be able to combine the resulting clusterings A(1),A(2), . . . ,A(n) into a single clustering A* such that the quality of A* is higher than that of A(i), for i = 1, 2, . . . , n.

  • The task of ER ensemble is for each ej belongs to E to provide a mapping dj->a*j . Here, a*j belongs {−1, 1} is the prediction of the combined algorithm for edge ej , where a*j = 1 if the overall algorithm believes a+j = 1, and a*j =−1 otherwise.


Problem definition1

Problem Definition

  • Using Blocking to Improve Efficiency : Virtual Connected Subgraph(VCS).

  • Blocking is important to improve the efficiency and also the quality of the algorithms.

  • Naïve Solution: Voting – It will count predictions.

  • Advanced Voting Scheme is Weighted Voting(WV).

  • Problem with voting is that it is a static non-adaptive scheme with respect to the context.


Context based framework

Context Based Framework

  • Context Features : Effectiveness and Generality

  • Number of Clusters: This is derived from the expected number of clusters per VCS as well as the number of clusters generated by each base-level system.

  • Node Fanout: It exploits the nodes adjacent to the edges at the edge level.

  • The overall context feature vector fj is computed as a linear combination of the feature vectors

  • For m features,

  • Adding more context features can increase the robustness and accuracy of the system


Overview of the approach

Overview of the Approach


Meta level classification

Meta-level Classification

  • 2 algorithms for meta-level classification:

  • Context-Extended Classification: Uses Regression to find K* and Nv*.The computes fji.

  • Now it is possible to learn a mapping from features dj ×fj into edge predictions aj* by training a classifier using the ground truth labels aj+ .


Exploiting context analysis for combining multiple entity resolution systems

  • Context-Weighted Classification: The idea of this approach is to learn for each base-level system Si a model Mi that would predict how accurate Si tends to be in a given context.


Creating final clusters

Creating Final Clusters

  • Correlation Clustering (CC) generates the partitioning of data points that optimizes a score function.

  • CC was specifically designed for the situation where there are conflicts in edge labeling.

  • Input: A graph where nodes correspond to references and edges between two nodes are labeled, in the simplest case, with +1 if there is evidence that u and v co-refer, and with −1 otherwise.

  • CC finds a clustering that has the most agreement with the edge labeling, therefore limiting the conflicts as much as possible.


Experimental evaluation

Experimental Evaluation

  • Set Up: Dataset, Evaluation Metrics, Baseline Methods and Classifiers.

  • Experiments on Web Domain: Base-level Algorithms, Comparison of various combination algorithms.

  • The main difference among these algorithms are

    (a) the similarity metrics chosen to compare pairs of references and (b) the clustering algorithms that

    generate the final clusters based on similarities.


Exploiting context analysis for combining multiple entity resolution systems

  • Efficiency: The running time of the ER ensemble

    algorithm consists of several parts:

    running base-level ER systems and loading the decisions by these systems on the edges, running the two regression classifiers to derive the context features, applying meta-level classification to predict the edge classes, and creating final clusters.

  • Effects of Blocking: Blocking is important to improve

    the efficiency of our algorithms.

  • It also has great impact on the quality of ER ensemble since the VCS level context features rely on the blocking.


Conclusion

Conclusion

  • This paper develops a novel ER ensemble framework for

    combining multiple base-level entity resolution systems.

  • The Context-Extended and Context-Weighted Ensemble approaches for meta-level classification that enable the framework to adapt to the dataset being processed based on the local context is also proposed.

  • This study demonstrates the superiority of the Context-Weighted Ensemble approach.


Future work of the paper

Future Work of the Paper

  • To explore principally different ways to create context features and a different classification scheme that trades quality for efficiency.


Exploiting context analysis for combining multiple entity resolution systems

Thank You!!!


  • Login