Exploiting Context Analysis
Download
1 / 25

Information Quality - PowerPoint PPT Presentation


  • 217 Views
  • Updated On :

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems. Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California, Irvine. ACM SIGMOD 2009 Conference, Providence, RI, USA, June 30 – July 2, 2009. © 2009 Dmitri V. Kalashnikov.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Quality' - Melvin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

Exploiting Context Analysis

for Combining

Multiple Entity Resolution Systems

Zhaoqi Chen, Dmitri V. Kalashnikov, SharadMehrotra

University of California, Irvine

ACM SIGMOD 2009 Conference, Providence, RI, USA, June 30 – July 2, 2009

© 2009 Dmitri V. Kalashnikov


Information quality l.jpg
Information Quality

Data Processing Flow

Quality of Analysis

Quality of Decisions

Analysis

Decisions

Quality of Data

(Raw) Data

  • Quality of data is critical

  • $1 Billion market

    • Estimated by Forrester Group

2


Entity resolution l.jpg
Entity Resolution

Entity Resolution (ER)

  • One of the Information Quality challenges

  • Disambiguating uncertain references to objects (in raw data)

Lookup

  • List of all objects is given

  • Match references to objects

Grouping

  • No list of objects is given

  • Group references that corefer

3


Example of analysis on bad data citeseer l.jpg
Example of Analysis on Bad Data: CiteSeer

Unexpected Entries

  • Lets check two people in DBLP

    • “A. Gupta”

    • “L. Zhang”

Analysis

Decisions

Raw Data

  • Analysis on bad data can lead to incorrect results

  • Fix errors before analysis

Data Quality Engine

CiteSeer: Top-k most cited authors

DBLP

DBLP


Motivating er ensembles l.jpg
Motivating ER Ensembles

  • Many ER solutions exist

  • No single ER solution is consistently the best

    • In terms of quality

  • Different ER solutions perform better in different contexts

  • Example:

    • LetKbe the true number of clusters

      • K is part of context

    • Assume that we use Agglomerative Clustering (Merging)

      if (K is large) then use Solution1: high threshold

      if (K is small) then use Solution2: low threshold

    • Observe that Kis unknown beforehand in this case!


Graphical view of er problem l.jpg
Graphical View of ER Problem

  • Virtual Connected Subgraph

  • Use simple techniques to create

  • similarity edges (or connect all refs.)

  • Similarity edges form VCSs

  • VCS properties

  • Virtual

    • Contains only similarity edges

  • Connected

    • A path exists between any 2 nodes

  • Subgraph

  • Subgraphs of the ER graph

  • 4. Complete

    • Adding more nodes/edges would violate (1) or (2)

[CKM: JCDL 2007]

Logically, the goal of ER is to partition each VCS correctly


Problem definition l.jpg
Problem Definition

  • Black boxes

  • Apply each to dataset

  • Outputs as graphs:

  • node - per each ref.

  • edges - connect each pair of references

  • For each edge ej, system Si makes decision dji{-1,+1}

  • Goal: combine

  • dj1,dj2, …,djn to make the final decision aj* for ej, such that the final clustering is as close to the ground truth as possible

Raw Dataset

Base-level

ER Systems

S1

S2

SN

Output of S2

Output of SN

Output of S1

Ensemble Techniques

Final Result


Toy example notation l.jpg

A

B

E

F

C

D

G

A

B

E

F

A

B

E

F

C

D

G

C

D

G

VCS1

VCS2

Toy Example: Notation

Graph

ER system S1

ER system S2


Na ve solutions voting and weighted voting l.jpg
Naïve Solutions: Voting and Weighted Voting

  • Weighted Voting

    • Assign weight wi to each system Si

    • For ej count weighted decisions dji made by Si’s

    • Proceed like in voting

Voting

For each edge ej count decisions dji made by each Si:

if (sum ≥ 0) then

ej - positive (+1)

else

ej - negative (-1)

Notice: if (n -1) systems perform poorly and only one performs well - the majority will win…


Limitations of weighted voting l.jpg
Limitations of Weighted Voting

  • No matter how we choose the weights, in our running example Accuracy ≤ 56%

  • Problem: WV is static non-adaptive to the context


Choosing context features l.jpg
Choosing Context Features

  • Error Features

    • Measure how far the prediction of a parameter by Si is different from the estimated true value of that parameter

    • The more the error is, the likely is that Si ’s solution is off

  • Combining Features

  • Number of Clusters (K)

    • K+ can help (merging ex.)

      • But, K+ is unknown!

    • Use regression to predict

      • K1, K2, …, Kn→ K*

      • Ki is # of clusters by Si

      • Features for edge ej:

  • Node Fanout

    • Nv+ is # of pos. edges of v

      • Also unknown

    • Use regression to predict

      • Nv1, Nv2,…,Nvn→ Nv*

      • Nvi is # according to Si

      • Features for edge ej:

Effectiveness – should capture well which ER systems work well in the given context

Generality– should be generic, not be present just in few datasets

11


Training testing l.jpg
Training & Testing

(training only)


Approach 1 context extended classification l.jpg

f2

≤0.9

>0.9

d1

d2

-1

1

-1

1

d2

C=-1

C=1

C=-1

-1

1

C=1

C=-1

Approach 1: Context-Extended Classification

  • Three Methods

    • Method1: learn

    • Method2:

    • Method3: 2n features → n

    • Confidence in “merge”

    • Learn

Context features:


Approach 2 context weighted classification l.jpg
Approach 2: Context-Weighted Classification

  • Idea

    • For each Si learn model Mi of how well Si performs in context

    • Learn fj → cj

  • Algorithm

    • Apply Si, get dj and fj for ej

    • Apply Mi on fj, get c*ji and pji

      • pji is confidence in c*ji

    • vji = dji·c*ji· pji; vj = (vj1, vj2,…,vjn)

      • May reverse some decisions

    • Learn/Use vj → a*j mapping


Clustering l.jpg
Clustering

  • Correlation Clustering

    • Once a*j{-1,+1} are known, we need to cluster

    • CC is designed to handle conflicts in labeling

    • Finds clustering that agrees the most with the labeling

    • CC can behave as Agglom. Clustering

      • Set params. accordingly

      • More generic scheme

  • Example

    • Simple merging will merge

    • CC will not

      • 2 negative vs. 1 positive


Experimental setup l.jpg
Experimental Setup

  • Dataset

    • Web domain: [WWW’05]

    • Publication domain: RealPub [TODS’05]

  • Baseline Algorithms

    • BestBase - Si that produces the best overall result

    • Majority Voting

    • Weighted Voting

    • Three clustering-aggregation algos from [GMT05]

    • Standard ER ensemble [ZR05]

  • Base-level Systems Si

    • TF-IDF+merging, with different merging threshold

    • Feature+relationship+Correlation Clustering

    • Etc.



Experiment 1 sanity check l.jpg
Experiment 1: “Sanity Check”

  • Introduce one “perfect” base-level system that always gets perfect results

    • Does not exist in practice

    • Utilizes the ground truth (unknown, of course)

  • As expected, the algorithms were able to learn to use that “perfect” system, and to ignore the results of other base-level systems


Comparing various aggregation algorithms l.jpg
Comparing Various Aggregation Algorithms

  • WeightedERE is #1

  • ExtendedERE is #2

  • Both are statistically better

    • According to t-test  = 0.05

  • Consistent improvement

    • 5 → 10 → 20

  • Measures: FP, FB,F1

  • Num. systems: 5, 10, 20

  • MajorVot < BestBase

    • Many base-algo’s do not perform well


Detailed results for 20 systems and fp l.jpg
Detailed results for 20 systems and Fp

  • None of the baselines is consistently better

    • See “BestIndiv”

    • That is why ER Ensemble outperforms the rest


Results on realpub l.jpg
Results on RealPub

  • Results are similar to those on WePS data


Comparing different combinations of base line systems on real pub l.jpg
Comparing Different Combinations of Base-line Systems on Real Pub

  • Combination 1

    • 1 Context, 3 RelER (t=0.05;0.01;0.005), and 1 RelAA (t=0.1)

  • Combination 2

    • 3 RelER (t=0.0005;0.0001;0.00005) and 2 RelAA (t=0.01;0.001)

  • W_ERE #1, E_ERE #2, Comb2 > Comb1


Efficiency issues l.jpg
Efficiency Issues Real Pub

  • Running time consist of

    • Running (in parallel) base-level systems

      • To get decision features

    • Running (in parallel) two regression classifiers

      • To get context features

    • Applying meta-classifier

      • Depends on the type of classifier

      • Usually not a bottleneck (1-5 sec on 5K to 50K data)

    • Applying correlation clustering

      • Not a bottleneck (under second)

  • Blocking

    • 1-2 order magnitude of improvement


Future work l.jpg
Future Work Real Pub

  • Efficiency

    • How to determine which base-level systems to run

      • And on which parts of data

    • Trade efficiency for quality

  • Features

    • Look into more feature types

    • Improve the quality of predictions

      • Apply framework iteratively

24


Questions l.jpg
Questions? Real Pub

  • Stella Chen

SharadMehrotra

www.ics.uci.edu/~sharad

GDF Project

www.ics.uci.edu/~dvk/GDF

Dmitri V. Kalashnikov

www.ics.uci.edu/~dvk


ad