1 / 24

Group Linkage

Group Linkage. Byung -Won On  (Penn State Univ.) Nick Koudas (Univ. of Toronto) Dongwon Lee (Penn State Univ.) Divesh Srivastava (AT&T Labs – Research). ICDE 2007. Outline. Introduction Matching Bipartite Graph Group Linkage Bipartite matching Pre-processing step to speed up

matt
Download Presentation

Group Linkage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Group Linkage Byung-Won On  (Penn State Univ.) Nick Koudas (Univ. of Toronto) Dongwon Lee (Penn State Univ.) DiveshSrivastava (AT&T Labs – Research) ICDE 2007

  2. Outline • Introduction • Matching • Bipartite Graph • Group Linkage • Bipartite matching • Pre-processing step to speed up • Greedy matching • Heuristic measure • Experiment & Result • Conclusion

  3. Introduction • Poor quality data in databases • Transcription errors • Lack of standards for recording • Poor database design • How to identify whether two entities are approximately the same? • Group linkage problem Ex: “J.Ullman” “J.D.Ullman” “Ullman, Jeffrey”

  4. Ex : Paper1 K.L.Hsueh DBLP Paper2 Peter Pan Paper3 Davy Jones Paper4 ACM Lily Hsueh Paper5 Group : each author Records : a list of citations per author Implement  Group Linkage Problem

  5. Matching • Matching:A matching in a graph G is a set of non-loop edges with no shared endpoints • Maximum matching: A matching that contains the largest possible number of edges.

  6. Bipartite Graph • Bipartite Graph:A graph is bipartite if V is the union of two disjoint independent sets called partite sets of G Bipartite matching

  7. Group Linkage(1) • Jaccard similarity measure between two sets s1 and s2 • Records from the two groups can be put into matching when they are identical.

  8. Group Linkage(2)

  9. Group Linkage(3) Register Allocation & Spilling via graph coloring r11 r12 r13 r14 K.L.Hsueh g1 Lily.Hsueh g2 r21 r22 r23 r24 r25 Register Allocation and Spilling via graph coloring Similar records normalize Group similarity ,each

  10. Bipartite Matching • Record-level similarity measure • [5] S.Chaudhuri, V.Ganti, and R. Kaushik. “A primitive Operator for Similarity Joins in Data Cleaning”. In IEEE ICED, 2006 • Maximum weight bipartite matching (BM) • [10] S. Guha, N.Koudas, A. Marathe, and D. Srivastava. “Merging the Results of Approximate Match Operations”. In VLCB, pages 636-647, 2004. • Applying this strategy for every pair of groups is infeasible. pre-processing step • Greedy matching • Heuristic measure

  11. Greedy Matching(1) • S1: For each record ri ∈ g1, find a record rj ∈ g2with the highest record-level similarity among those with sim() ≥ ρ. • S2: Same as S1 r11 r12 r13 r14 g1 g2 r21 r22 r23 r24 r25 May not be a matching!

  12. Greedy Matching(2) • Upper and lower bounds to BMsim,ρ r11 r12 r13 r14 g1 g2 r21 r22 r23 r24 r25

  13. Greedy Matching(2) • is bounded • Only when , the more expensive computation would be needed.

  14. Heuristic Measure • In practice that pairs of groups with a high value of will share at least one record with a high record-level similarity. • Simpler and faster measure

  15. Implementation • Implemented UBsim,ρ, LBsim,ρ, and MAXsim,ρin SQL.(We only discuss UB) • Notation:

  16. Experiment • Real data sets: Data sets from ACM and DBLP citation digital libraries. • R1: uniform data sets • R1a—average # of citations: left=41, right=25 • R1b—average # of citations: left=40, right=55 • R2: skewed data sets • R2DB —average # of citations: left=30, right=9 • R2AI —average # of citations: left=31, right=10 • R2Net —average # of citations: left=22, right=6

  17. Experiment • Synthetic data sets: • S1a and S1b: same as R1a, but dummy authors are injected to the right • S1a: # of citations1/3 • S1b: # of citations3 • S2: using “dbgen” tool to generate dummy authors with varying levels of errors and inserted it to the right data set.

  18. Experiment • Evaluation Metrics—average recall • if a2 is included in the top-k answer window for a1, then recall becomes 1, and 0 otherwise • Compared Methods • A(k1)|B(k2). • Step1: A, window size k1 • Step2: B, window size k2 • Microsoft SQL Server 2000 on Pentium III 3GHZ/512MB machine

  19. Results • uniform data set : R1 real data set

  20. Results • S1 and S2 synthetic data sets BM outperforms JA by 16-17% JA incorrect select dummy authors JA and BM are directly applied to S2

  21. Results • R2 real data set UB outperform MAX in recall UB MAX Pre-processing using: UB MAX

  22. Results • Record-level similarity measure :cosine similarity with TF/IDF weighting. • Running time against R2 (in sec)

  23. Results • Window size

  24. Conclusion • Proposed a bipartite matching based group similarity measure to solve group linkage problem. • Proved upper and lower bounds of BM can be used for speed-up. • BM is more robust group similarity measure than others

More Related