1 / 27

Exploiting Relationships for Object Consolidation

Work supported by NSF Grants IIS-0331707 and IIS-0083489. Exploiting Relationships for Object Consolidation. Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine http://www.ics.uci.edu/~dvk/RelDC

muriel
Download Presentation

Exploiting Relationships for Object Consolidation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Work supported by NSF Grants IIS-0331707 and IIS-0083489 Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine http://www.ics.uci.edu/~dvk/RelDC http://www.itr-rescue.org(RESCUE) ACM IQIS 2005

  2. Talk Overview • Motivation • Object consolidation problem • Proposed approach • RelDC: Relationship based data cleaning • Relationship analysis and graph partitioning • Experiments

  3. CiteSeer Rank Why do we need “Data Cleaning”? q Hi, my name is Jane Smith. I’d like to apply for a faculty position at your university Wow! Unbelievable! Are you sure you will join us even if we do not offer you tenure right away? OK, let me check something quickly … ??? • Publications: • …… • …… • …… Jane Smith – Fresh Ph.D. Tom - Recruiter

  4. What is the problem? • Names often do not uniquely identify people CiteSeer: the top-k most cited authors DBLP DBLP

  5. Comparing raw and cleaned CiteSeer Cleaned CiteSeer top-k CiteSeer top-k

  6. Object Consolidation Problem • Cluster representations that correspond to the same “real” world object/entity • Two instances: real world objects are known/unknown Representations of objects in the database r1 r2 r3 r4 r5 r6 r7 rN o1 o2 o3 o4 o5 o6 o7 oM Real objects in the database

  7. RelDC Approach • Exploit relationships among objects to disambiguate when traditional approach on clustering based on similarity does not work RelDC Framework Relationship - based Data Cleaning ARG ? f 1 B f 1 C A + ? f 2 f 2 X Y D Y X ? f 3 f 3 ? E F f 4 f 4 Traditional Methods Relationship Analysis features and context

  8. Attributed Relational Graph (ARG) View the database as an ARG Nodes • per cluster of representations (if already resolved by feature-based approach) • per representation (for “tough” cases) Edges • Regular – correspond to relationships between entities • Similarity – created using feature-based methods on representations

  9. Context Attraction Principle (CAP) Who is “J. Smith” • Jane? • John?

  10. Does the CAP principle hold over real datasets? That is, if we consolidate objects based on it, will the quality of consolidation improves? Can we design a generic strategy that exploits CAP for consolidation? Questions to Answer

  11. Consolidation Algorithm • Construct ARG and identify all virtual clusters (VCSs) • use FBS in constructing the ARG • Choose a VCS and compute connection strength between nodes • for each pair of repr. connected via a similarity edge • Partition the VCS • use a graph partitioning algorithm • partitioning is based on connection strength • after partitioning, adjust ARG accordingly • go to Step 2, if more potential clusters exists

  12. Connection Strength c(u,v) Models for c(u,v) • many possibilities • diffusion kernels, random walks, etc • none is fully adequate • cannot learn similarity from data Diffusion kernels • (x,y)= 1(x,y) “base similarity” • via direct links (of size 1) • k(x,y) “indirect similarity” • via links of size k • B: where Bxy= B1xy = 1(x,y) • base similarity matrix • Bk: indirect similarity matrix • K: total similarity matrix, or “kernel”

  13. Connection Strength c(u,v) (cont.) Instantiating parameters • Determining (x,y) • regular edges have types T1,...,Tn • types T1,...,Tnhave weights w1,...,wn • (x,y) = wi • get the type of a given edge • assign this weigh as base similarity • Handling similarity edges • (x,y) assigned value proportional to similarity (heuristic) • Approach to learn (x,y) from data (ongoing work) Implementation • we do not compute the whole matrix K • we compute one c(u,v) at a time • limit path lengths by L

  14. Consolidation via Partitioning Observations • each VCS contains representations of at least 1 object • if a repr. is in VCS, then the rest of repr. of the same object are in it too Partitioning • two cases • k, the number of entities in VSC, is known • k is unknown • when k is known, use any partit. algo • maximize inside-con, minimize outside-con. • we use [Shi,Malik’2000] • normalized cut • when k is unknown • split into two: just to see the cut • compare cut against threshold • decide “to split” or “not to split” • Iterate

  15. Measuring Quality of Outcome • dispersion • for an entity, into how many clusters its repr. are clustered, ideal is 1 • diversity • for a cluster, how many distinct entities it covers, ideal is 1 • Entity uncertainty • for an entity, if out of m represent. m1 to C1; ...; mnto Cnthen • Cluster Uncertainty • if a cluster consists of represent.: m1 of E1; ...; mnof Enthen (same...) • ideal entropy is zero

  16. Experimental Setup Uncertainty • d1,d2,...,dn are director entities • pick a fraction d1,d2,...,dm • Group entries in size k, • e.g. in groups of two {d1,d2}, ... ,{d9,d10} • make all representations of a group indiscernible by FBS, ... Baseline 1 • one cluster per VCS, regardless • Equivalent to using only FBS • ideal dispersion & H(E)! Baseline 2 • knows grouping statistics • gueses #ent in VCS • random assigns repr. to clusters Parameters • L-short simple paths, L = 7 • L is the path-length limit Note • The algorithm is applied to “tough cases”, after FBS already has successfully consolidated many entries! RealMov • movies (12K) • people (22K) • actors • directors • producers • studious (1K) • producing • distributing

  17. Sample Movies Data

  18. The Effect of L on Quality Cluster Entropy & Diversity Entity Entropy & Dispersion

  19. Effect of Threshold and Scalability

  20. Summary RelDC • domain-independent data cleaning framework • uses relationships for data cleaning • reference disambiguation [SDM’05] • object consolidation [IQIS’05] Ongoing work • “learning” the importance of relationships from data • Exploiting relationships among entities for other data cleaning problems

  21. Contact Information RelDC project www.ics.uci.edu/~dvk/RelDC www.itr-rescue.org (RESCUE) Zhaoqi Chen chenz@ics.uci.edu Dmitri V. Kalashnikov www.ics.uci.edu/~dvk dvk@ics.uci.edu Sharad Mehrotra www.ics.uci.edu/~sharad sharad@ics.uci.edu

  22. extra slides…

  23. Object Consolidation Notation • O={o1,...,o|O|} set of entities • unknown in general • X={x1,...,x|X|} set of repres. • d[xi] the entity xirefers to • unknown in general • C[xi] all repres. that refer to d[xi] • “group set” • unknown in general • the goal is to find it for each xi • S[xi] all repres. that can be xi • “consolidation set” • determined by FBS • we assume C[xi]  S[xi]

  24. Object Consolidation Problem • LetO={o1,...,o|O|} be the set of entities • unknown in general • Let X={x1,...,x|X|} be the set of representations • Map xi to its corresponding entity oj in O d[xi] the entity xirefers to • unknown in general • C[xi] all repres. that refer to d[xi] • “group set” • unknown in general • the goal is to find it for each xi • S[xi] all repres. that can be xi • “consolidation set” • determined by FBS • we assume C[xi]  S[xi]

  25. RelDC Framework

  26. Connection Strength Computation of c(u,v) Phase 1: Discover connections • all L-short simple paths between u and v • bottleneck • optimizations, not in IQIS’05 Phase 2: Measure the strength • in the discovered connections • many c(u,v) models exist • we use model similar to diffusion kernels

  27. Our c(u,v) Model Our model & Diff. kernels • virtually identical, but... • we do not compute the whole matrix K • we compute one c(u,v) at a time • we limit path lengths by L • (x,y) is unknown in general • the analyst assigns them • learn from data (ongoing work) Our c(u,v) model • regular edges have types T1,...,Tn • types T1,...,Tnhave weights w1,...,wn • (x,y) = wi • get the type of a given edge • assign this weigh as base similarity • paths with similarity edges • might not exist, use heuristics

More Related