1 / 30

Record Linkage with Uniqueness Constraints and Erroneous Values

Record Linkage with Uniqueness Constraints and Erroneous Values. Zhang Xiaojian 2010 November 26 WAMDM Group Meeting. Data integration process. Application2. Data fusion Felix ACMC08. Application1. Schema matching E.Rahm VLDBJ01. Data fusion Felix WWW06. Duplicate detection

gabrielle
Download Presentation

Record Linkage with Uniqueness Constraints and Erroneous Values

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Record Linkage with Uniqueness Constraints and Erroneous Values Zhang Xiaojian 2010 November 26 WAMDM Group Meeting

  2. Data integration process • Application2 • Data fusion • Felix ACMC08 • Application1 • Schema matching • E.Rahm VLDBJ01 • Data fusion • Felix WWW06 • Duplicate detection • Record linkage • A.K.E TKDE07 • Entity resolution • Tect Report Stanford • Data fusion • X Dong VLDB09 • Data exchange • R.Fagin TODS05 Cleaned Data • uncertainty s s s s s s

  3. Contents • Motivation • Problem definition • Solution • Experimental results • Conclusions • Getting some problems from the paper

  4. Motivation s1 s2 integration s3 Cleaned Data Search Box s4

  5. Current Solution • Current two-step solution • Step 1: Record Linkage • link records that are likely to refer to the same real-world entity • [A.K Elmagarmid, TKDE’07], [W.Winkler, Tech Report’06] • Step 2: Data Fusion • merge the linked records and decide the correct values for each result entity in the presence of conflicts [J. Bleiholder et. al, ACM Computing Surveys08] • Uniqueness constraint • Many real world entities has a unique value for the attribute. E.g. Website(IP ), Phone, Facebook account • Co-existence of conflicts and duplicates makes the problem hard to solve

  6. Limitations of Current Solution (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) Assume that Phone and Address satisfy uniqueness constraints • Erroneous values may prevent correct matching • Current solutions may fall short when the uniqueness constraints exist (PHONE) 9400 missing

  7. Contents • Motivation • Problem definition • Solution • Experimental results • Conclusions and Future work

  8. Problem Definition • Input • A set of records provided by a set of independent data sources • A set of (hard or soft) uniqueness constraints • Output: • Real-world entities • For each (hard or soft) uniqueness attribute of each entity • True value

  9. Concepts • Entity and Attribute • E.g., • Value vs. Representations (e.g., New York City  New York City, NYC, N.Y.C) • Constraint • Uniqueness constraint (hard constraint): DA • Business Name, Business Phone, Business Address • Soft uniqueness constraint (soft constraint): DA • Business Phone (e.g., p1=30%, p2=10% ) Where p1 is the upper bound probability of an entity having multiple values for A and p2 is the upper bound probability of a value of A being shared by multiple entities. Special case: key attribute (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) 1-p1 1-p2 1-p1 1-p2

  10. Contents • Motivation • Problem definition • Solution • Experimental results • Conclusions and Future work

  11. K-Partite Graph Encoding (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) Microsofe Corp. (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) N1 s(1) P1 s(1) xxx-1255 s(1) A1 1 Microsoft Way S1 Microsofe Corp. Xxx-1255 1 Microsoft Way

  12. Encoding of the ideal solution Microsofe Corp. Microsoft Corp. MS Corp. Macrosoft Inc. N3 N1 N2 N4 P1 P3 P2 P4 xxx-9400 xxx-1255 xxx-2255 xxx-0500 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. Pre-processing for the K-partite graph Clustering in every partite (subset)

  13. Clustering with Hard Constraint Microsoft Corp. MS Corp. Microsofe Corp. Macrosoft Inc. N3 N1 N2 N4 P1 P4 P3 P2 xxx-9400 xxx-1255 xxx-0500 xxx-2255 A2 A3 A1 C2 C3 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. Clustering the whole graph G(S) C4 C1

  14. Clustering w.r.t hard constraint • Ideal clustering should meet two requests • High cohesion within each cluster • Low correlation between different clusters • Objective function for getting “best” clustering • Choosing Davies-Bouldin index [Davies and Bouldin TPAML79] • The goal is to minimize Davies-Bouldin index min( ) • corresponds to complement of cohesion • corresponds to complement of correlation High cohesion High cohesion Low correlation

  15. Computing cluster distance • Cluster distance function • is similarity distance for measuring similarity between value representations of the same attributes. • is association distance for measuring association between value representations of different attributes. • The key is how to calculate and for computing cluster distance

  16. Similarity Distance Within the same cluster • How to get  C1 C4 d1S(C1,C1) = 1 − (0.95+0.65+0.65)/3 = 0.25 (name) 0.7 d2S(C1,C1) = 0 (phone) d3S(C1,C1) = 0 (address) 0.7 N1 N2 N3 0.65 N4 0.4 0.95 0.65 MS Corp. dS(C1,C1) = (0.25+0+0)/3 = 0.083 Microsofe Corp. Microsoft Corp. Macrosoft Corp. P1 P4 0 xxx-0500 xxx-1255 Within the different clusters A1 A2 A3 d1S(C1,C4) = 1 − (0.7+0.7+0.4)/3 = 0.4 (name) 0.9 0 2 Sylvan Way 2 Sylvan Way 1 Microsoft Way d2S(C1,C4) = 1-0 = 1 (phone) d3S(C1,C4) = 1-0 = 1 (address) 0 dS(C1,C4) = (0.4+1+1)/3=0.8

  17. Association Distance How to get association distance Within the same cluster d1,2A (C1,C1) = 1 − 7/9 = 0.22  d1,3A(C1,C1) = 1− 8/9 = 0.11 d2,3A (C1,C1) = 1− 7/8 = 0.125 Macrosoft Inc. Microsoft Corp. MS Corp. Microsofe Corp. dA(C1,C1) = (0.22+0.11+0.125)/3 = 0.153 N3 N1 N2 N4 Within the different clusters S(10) S(1-9) s(2-5) s(1) S(7-8) d1,2A (C1,C4) = 1 − max(1/10,0/10) = 0.9 s(1) P1 S(2-9) S(10) P4 s(2-6) S(7-8) d1,3A(C1,C4) = 0.9 d2,3A (C1,C4) = 1 s(1-2) xxx-1255 xxx-0500 S(2-10) dA(C1,C4) = (0.9+0.9+1)/3 = 0.93 s(1) s(1-5,7,8) A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C1 C4

  18. Greedy Algorithm--CLUSTER • Obtaining optimal clustering is intractable • [T.F. Gonzales., 82],[J. Simal et al., 06] • Algorithm: CLUSTER • Step1: Initialization • Cluster value representations according to their similarity distance and association distance • Step2: Adjustment • For each node, moving to the cluster that minimize this Davies-Bouldin(DB) index • Step3: Convergence checking • stop if step 2 doesn’t change the clustering result. Otherwise, repeat step 2

  19. Φ=0.94 Φ=0.93 Φ=0.71 Φ=0.92 Microsoft Corp. Microsofe Corp. MS Corp. Macrosoft Inc. Φ=1.15 Φ=1.16 N3 N1 N2 N4 Φ=0.89 Φ=0.71 Φ=0.45 P4 P1 P3 P2 xxx-0500 xxx-9400 xxx-1255 xxx-2255 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C2 C4 C3 C1

  20. Matching w.r.t. Soft Constraints MS Corp. • Next step is to find the best matching between key attribute and soft uniqueness attributes • How to match? Microsoft Corp. Microsofe Corp. Macrosoft Inc. NC1 NC4 7 s(1-5,7,8) 9 S(1-9) 1 S(6) 1 S(10) 5 s(1-5) Graph Transform 9 S(1-9) PC3 PC2 PC4 PC1 xxx-2255 xxx-9400 xxx-1255 xxx-0500 8 S(1-8) 1 S(10) AC4 AC1 2 Sylvan W. 1 Microsoft Way 2 Sylvan Way

  21. Matching w.r.t. Soft Constraint • Goals • Maximizing the sum of weights of selected edges w(e) • Minimizing the gap for each node Gap(N) • How to balance above two goals? Giving a score function to balance w(e) and Gap(N) • Getting the “best” matching • Maximize Score function • Greedy algorithm: MATCHT • Getting Gap(N) and W(u,v) N1 9 (s2-s10) 1 (s1) 7 (s4-s10) P1 P2 P3

  22. Continue the example Solution 1 Solution 2 N1 N2 N1 N2 3 (s3-s5) 3 (s3-s5) 9 (s2-s10) Greedily select 9 (s2-s10) 1 (s1) 1 (s1) 8 (s2-s9) 8 (s2-s9) 10 (s1-s10) 10 (s1-s10) 7 (s4-s10) 7 (s4-s10) Greedily select P1 P2 P3 P1 P2 P3 P1 P2 P4 P2 P4 P4 Gap(N1) = 9 Gap(N1) = 3 Gap(N2) = 5 Gap(N2) = 0 Gap(P1) = 0 Gap(P2) = 4 Gap(P2) = 4 Gap(P4) = 2 w(N1,P1) = 1 w(N1,P2) = 7 w(N2,P2) = 3 w(N2,P4) = 8 Solution 3 Solution 4 N1 N2 N1 N2 3 (s3-s5) 3 (s3-s5) 9 (s2-s10) 9 (s2-s10) 1 (s1) Greedily select 1 (s1) 8 (s2-s9) 8 (s2-s9) 10 (s1-s10) 7 (s4-s10) 10 (s1-s10) 7 (s4-s10) P1 P2 P3 P4 P4 P1 P2 P3 P4 P3 Gap(N1) =0 Gap(N2) = 0 Gap(N1) =1 Gap(N2) = 0 Gap(P4) = 2 Gap(P4) = 2 Gap(P3) = 0 Gap(P4) = 2 w(N1,P4) =10 w(N2,P2) = 8 w(N1,P3) =9 w(N2,P2) = 8

  23. Contents • Motivation • Problem definition • Solution • Experimental results • Conclusions and Future work

  24. Experiment Settings • Dataset I • Business listings for two zip codes(07035,07715) from multiple sources

  25. Experiment Settings • Implementation • MATCH +CLUSTER • LINK: linkage only • FUSE: data fusion only • LINKFUSE: first LINK , second FUSE • Golden Standard: by manually checking • Measures: Precision/Recall/F-measure

  26. Accuracy 07035 Matching (NAME-PHONE) 07035 Matching (NAME-ADDRESS) 07035 Clustering (NAME) 07715 Matching (NAME-PHONE) 07715 Matching (NAME-ADDRESS) 07715 Clustering (NAME)

  27. Efficiency and Scalability

  28. Conclusions • In the real-world, we need to resolve duplicates and conflicts at the same time. • We reduce the problem to a k-partite graph clustering and matching problem • Combine linkage and fusion • Experiments show high efficiency and scalability

  29. Thank You!

More Related