self tuning in graph based reference disambiguation
Download
Skip this Video
Download Presentation
Self-tuning in Graph-Based Reference Disambiguation

Loading in 2 Seconds...

play fullscreen
1 / 27

Self-tuning in Graph-Based Reference Disambiguation - PowerPoint PPT Presentation


  • 68 Views
  • Uploaded on

Self-tuning in Graph-Based Reference Disambiguation. Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine. Overview. Intro to Data Cleaning Entity resolution RelDC Framework Past work Adapting to data The new part

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Self-tuning in Graph-Based Reference Disambiguation' - marietta-anglim


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
self tuning in graph based reference disambiguation

Self-tuning in Graph-Based Reference Disambiguation

Rabia Nuray-Turan

Dmitri V. Kalashnikov

Sharad Mehrotra

Computer Science Department

University of California, Irvine

overview
Overview
  • Intro to Data Cleaning
    • Entity resolution
  • RelDC Framework
    • Past work
  • Adapting to data
    • The new part
    • Reduction to an Optimization problem
      • Linear programming
  • Experiments

DASFAA 2007, Bangkok, Thailand

data cleaning
Data Cleaning

Analysis on bad data leads to wrong conclusions

DASFAA 2007, Bangkok, Thailand

example of the problem citeseer top k
Example of the problem: CiteSeer top-K

Suspicious entries

  • Lets go to DBLP website
    • which stores bibliographic entries of many CS authors
  • Lets check two people
    • “A. Gupta”
    • “L. Zhang”

CiteSeer: the top-k most cited authors

DBLP

DBLP

DASFAA 2007, Bangkok, Thailand

two most common entity resolution challenges
Two Most Common Entity-Resolution Challenges

Fuzzy lookup

  • reference disambiguation
  • match references to objects
    • list of all objects is given

Fuzzy grouping

  • group together object repre-sentations, that correspond to the same object

DASFAA 2007, Bangkok, Thailand

standard approach to entity resolution
Standard Approach to Entity Resolution

DASFAA 2007, Bangkok, Thailand

overview1
Overview
  • Intro to Data Cleaning
  • RelDC Framework
    • Past work
  • Adapting to data
    • The new part
    • Reduction to an Optimization problem
      • Linear programming
  • Experiments

DASFAA 2007, Bangkok, Thailand

reldc framework
RelDC Framework

DASFAA 2007, Bangkok, Thailand

reldc framework1
RelDC Framework
  • Past work
    • SDM’05, TODS’06
  • Domain-independent framework
    • Viewing the dataset as an Entity Relationship Graph
    • Analyzes paths in this graph
  • Solid theoretic foundation
    • Optimization problem
  • Scales to large datasets
  • Robust under uncertainty
  • High disambiguation quality
  • No Self-tuning
    • This paper solves this challenge

DASFAA 2007, Bangkok, Thailand

entity relationship graph
Entity-Relationship Graph
  • Choice node
    • For uncertain references
    • To encode options/possibilities yr1, … yrN
  • Among options yr1, … yrN
    • Pick the most strongly connected one
      • CAP principle
    • Analyze paths in G
      • that exist between xr and yrj, for all j
    • Use a model to measure connection strength
  • “Connection strength” model
    • c(u,v), for nodes u and v in G
      • how strongly u and v are connected in G
    • RandomWalk-based
      • Fixed
      • Based onIntuition!!!
    • This paper, instead, learns such a model from data.

DASFAA 2007, Bangkok, Thailand

overview2
Overview
  • Intro to Data Cleaning
  • RelDC Framework
    • Past work
  • Adapting to data
    • The new part
    • Reduction to an Optimization problem
      • Linear programming
  • Experiments

DASFAA 2007, Bangkok, Thailand

adaptive solution
Adaptive Solution
  • Classify the found paths in the graph into a finite set of path types

ST ={ T1, T2, …, TN}

  • If paths p1 and p2 are of the same type then they are treated as identical.
  • We can show the connection between nodes u and v with a path-type count vector:

Tuv = { c1, c2, …, cN}

  • If there is a way to associate path Ti to wi then connection strengthwill be:

DASFAA 2007, Bangkok, Thailand

problems to answer
Problems to Answer
  • How will we classify the paths?
  • How will we associate each path type with a weight?

DASFAA 2007, Bangkok, Thailand

classifying paths
Classifying Paths
  • Path Type Model (PTM):
    • Views each path as a sequence of edges
    • Each edge ei has a type Ei associated with it
    • Thus, can associate each path p with a string
    • Different strings correspond to different path types
    • Associate each string a weight
  • Different models are also possible

DASFAA 2007, Bangkok, Thailand

learning path weights optimization problem
Learning Path Weights : Optimization Problem
  • CAP Principle states that:
    • the right option will be better connected
  • Linear programming
  • Learn path types weight w’s.

DASFAA 2007, Bangkok, Thailand

final solution
Final Solution
  • The value of c(xr,yrj)- c(xr,yrl) should be maximized for all r, l≠j
  • Then final solution:

DASFAA 2007, Bangkok, Thailand

example graph
Example -Graph

P1= e1-e3-e1 P2= e1-e1-e3

P3= e1-e2-e2-e3 P4= e1-e2-e3-e2-e3

DASFAA 2007, Bangkok, Thailand

example solution
Example- Solution
  • w1 =1
  • w3 = w4 = 0
  • w2 can be anything between 0 and 1.

DASFAA 2007, Bangkok, Thailand

overview3
Overview
  • Intro to Data Cleaning
  • RelDC Framework
    • Past work
  • Adapting to data
    • The new part
    • Reduction to an Optimization problem
      • Linear programming
  • Experiments

DASFAA 2007, Bangkok, Thailand

experimental setup
Experimental Setup

Parameters

  • When looking for L-short simple paths, L = 5
  • L is the path-length limit

RealMov:

  • movies (12K)
  • people (22K)
    • actors
    • directors
    • producers
  • studious (1K)
    • producing
    • distributing
  • ground truth is known

SynPub datasets:

  • many ds of five different types
  • emulation of RealPub
    • publications (5K)
    • authors (1K)
    • organizations (25K)
    • departments (125K)
  • ground truth is known

DASFAA 2007, Bangkok, Thailand

experimental results on movies
Experimental Results on Movies
  • Parameters :
  • Fraction : fraction of uncertain references in the dataset
  • Each reference has 2 choices

DASFAA 2007, Bangkok, Thailand

experimental results on movies ii
Experimental Results on Movies- II

Number of options based on PMF Distribution

DASFAA 2007, Bangkok, Thailand

experimental results on synpub
Hybrid Model :Experimental Results on SynPub

RandomWalk, PTM and the Hybrid Model have the same accuracy

Is RandomWalk the optimum model for Publications domain?

DASFAA 2007, Bangkok, Thailand

summary
Summary
  • Main Contribution
    • An adaptive solution for connection strength
    • Model learns the weights of different path types
  • Ongoing work
    • Using different models to learn the importance of paths in the connection strength
      • Use of standard machine learning techniques for learning: such as decision trees, etc…
      • Different ways to classify paths

DASFAA 2007, Bangkok, Thailand

contact information
Contact Information
  • RelDC project
    • www.ics.uci.edu/~dvk/RelDC
    • www.itr-rescue.org (RESCUE)
  • Rabia Nuray-Turan (contact author)
    • www.ics.uci.edu/~rnuray
  • Dmitri V. Kalashnikov
    • www.ics.uci.edu/~dvk
  • Sharad Mehrotra
    • www.ics.uci.edu/~sharad

DASFAA 2007, Bangkok, Thailand

ad