Loading in 5 sec....

Self-tuning in Graph-Based Reference DisambiguationPowerPoint Presentation

Self-tuning in Graph-Based Reference Disambiguation

- 66 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Self-tuning in Graph-Based Reference Disambiguation' - marietta-anglim

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Self-tuning in Graph-Based Reference Disambiguation

Rabia Nuray-Turan

Dmitri V. Kalashnikov

Sharad Mehrotra

Computer Science Department

University of California, Irvine

Overview

- Intro to Data Cleaning
- Entity resolution

- RelDC Framework
- Past work

- Adapting to data
- The new part
- Reduction to an Optimization problem
- Linear programming

- Experiments

DASFAA 2007, Bangkok, Thailand

Example of the problem: CiteSeer top-K

Suspicious entries

- Lets go to DBLP website
- which stores bibliographic entries of many CS authors

- Lets check two people
- “A. Gupta”
- “L. Zhang”

CiteSeer: the top-k most cited authors

DBLP

DBLP

DASFAA 2007, Bangkok, Thailand

Two Most Common Entity-Resolution Challenges

Fuzzy lookup

- reference disambiguation
- match references to objects
- list of all objects is given

Fuzzy grouping

- group together object repre-sentations, that correspond to the same object

DASFAA 2007, Bangkok, Thailand

Standard Approach to Entity Resolution

DASFAA 2007, Bangkok, Thailand

Overview

- Intro to Data Cleaning
- RelDC Framework
- Past work

- Adapting to data
- The new part
- Reduction to an Optimization problem
- Linear programming

- Experiments

DASFAA 2007, Bangkok, Thailand

RelDC Framework

DASFAA 2007, Bangkok, Thailand

RelDC Framework

- Past work
- SDM’05, TODS’06

- Domain-independent framework
- Viewing the dataset as an Entity Relationship Graph
- Analyzes paths in this graph

- Solid theoretic foundation
- Optimization problem

- Scales to large datasets
- Robust under uncertainty
- High disambiguation quality
- No Self-tuning
- This paper solves this challenge

DASFAA 2007, Bangkok, Thailand

Entity-Relationship Graph

- Choice node
- For uncertain references
- To encode options/possibilities yr1, … yrN

- Among options yr1, … yrN
- Pick the most strongly connected one
- CAP principle

- Analyze paths in G
- that exist between xr and yrj, for all j

- Use a model to measure connection strength

- Pick the most strongly connected one

- “Connection strength” model
- c(u,v), for nodes u and v in G
- how strongly u and v are connected in G

- RandomWalk-based
- Fixed
- Based onIntuition!!!

- This paper, instead, learns such a model from data.

- c(u,v), for nodes u and v in G

DASFAA 2007, Bangkok, Thailand

Overview

- Intro to Data Cleaning
- RelDC Framework
- Past work

- Adapting to data
- The new part
- Reduction to an Optimization problem
- Linear programming

- Experiments

DASFAA 2007, Bangkok, Thailand

Adaptive Solution

- Classify the found paths in the graph into a finite set of path types
ST ={ T1, T2, …, TN}

- If paths p1 and p2 are of the same type then they are treated as identical.
- We can show the connection between nodes u and v with a path-type count vector:
Tuv = { c1, c2, …, cN}

- If there is a way to associate path Ti to wi then connection strengthwill be:

DASFAA 2007, Bangkok, Thailand

Problems to Answer

- How will we classify the paths?
- How will we associate each path type with a weight?

DASFAA 2007, Bangkok, Thailand

Classifying Paths

- Path Type Model (PTM):
- Views each path as a sequence of edges
- <e1,e2,e3,…,en>

- Each edge ei has a type Ei associated with it
- Thus, can associate each path p with a string
- <E1,E2,E3,…,En>

- Different strings correspond to different path types
- Associate each string a weight

- Views each path as a sequence of edges
- Different models are also possible

DASFAA 2007, Bangkok, Thailand

Learning Path Weights : Optimization Problem

- CAP Principle states that:
- the right option will be better connected

- Linear programming
- Learn path types weight w’s.

DASFAA 2007, Bangkok, Thailand

Final Solution

- The value of c(xr,yrj)- c(xr,yrl) should be maximized for all r, l≠j
- Then final solution:

DASFAA 2007, Bangkok, Thailand

Example -Graph

P1= e1-e3-e1 P2= e1-e1-e3

P3= e1-e2-e2-e3 P4= e1-e2-e3-e2-e3

DASFAA 2007, Bangkok, Thailand

Example- Solution

- w1 =1
- w3 = w4 = 0
- w2 can be anything between 0 and 1.

DASFAA 2007, Bangkok, Thailand

Overview

- Intro to Data Cleaning
- RelDC Framework
- Past work

- Adapting to data
- The new part
- Reduction to an Optimization problem
- Linear programming

- Experiments

DASFAA 2007, Bangkok, Thailand

Experimental Setup

Parameters

- When looking for L-short simple paths, L = 5
- L is the path-length limit

RealMov:

- movies (12K)
- people (22K)
- actors
- directors
- producers

- studious (1K)
- producing
- distributing

- ground truth is known

SynPub datasets:

- many ds of five different types
- emulation of RealPub
- publications (5K)
- authors (1K)
- organizations (25K)
- departments (125K)

- ground truth is known

DASFAA 2007, Bangkok, Thailand

Experimental Results on Movies

- Parameters :
- Fraction : fraction of uncertain references in the dataset
- Each reference has 2 choices

DASFAA 2007, Bangkok, Thailand

Experimental Results on Movies- II

Number of options based on PMF Distribution

DASFAA 2007, Bangkok, Thailand

Experimental Results on SynPub

RandomWalk, PTM and the Hybrid Model have the same accuracy

Is RandomWalk the optimum model for Publications domain?

DASFAA 2007, Bangkok, Thailand

Effect of Random Relationships in the Publications Domain

DASFAA 2007, Bangkok, Thailand

Summary

- Main Contribution
- An adaptive solution for connection strength
- Model learns the weights of different path types

- Ongoing work
- Using different models to learn the importance of paths in the connection strength
- Use of standard machine learning techniques for learning: such as decision trees, etc…
- Different ways to classify paths

- Using different models to learn the importance of paths in the connection strength

DASFAA 2007, Bangkok, Thailand

Contact Information

- RelDC project
- www.ics.uci.edu/~dvk/RelDC
- www.itr-rescue.org (RESCUE)

- Rabia Nuray-Turan (contact author)
- www.ics.uci.edu/~rnuray

- Dmitri V. Kalashnikov
- www.ics.uci.edu/~dvk

- Sharad Mehrotra
- www.ics.uci.edu/~sharad

DASFAA 2007, Bangkok, Thailand

Download Presentation

Connecting to Server..