1 / 63

Reference Reconciliation in Complex Information Spaces

Reference Reconciliation in Complex Information Spaces. Xin (Luna) Dong , Alon Halevy, Jayant Madhavan @ Sigmod 2005 University of Washington. Semex : Personal Information Management System. Homepage(1). SenderOfEmails(7595). RecipientOfEmails(8547). AuthorOfArticles(52).

pembroke
Download Presentation

Reference Reconciliation in Complex Information Spaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Madhavan @ Sigmod 2005 University of Washington

  2. Semex: Personal Information Management System Homepage(1) SenderOfEmails(7595) RecipientOfEmails(8547) AuthorOfArticles(52) MentionedIn(315)

  3. Semex: Personal Information Management System Email Contacts(1145) Co-authors(24)

  4. Semex: Personal Information Management System Article: Reference Reconciliation in Complex Information Spaces Authors PublishedIn Cites(33) CitedBy FromFile

  5. Semex: Personal Information Management System Xin (Luna) Dong Lab-#dong xin dong xin luna • ¶­ðà xinluna dong Names luna x. dong dongxin Emails xin dong

  6. Semex Without Deduplication Search results for luna 23 persons luna dong SenderOfEmails(3043) RecipientOfEmails(2445) MentionedIn(94)

  7. Semex Without Deduplication Search results for luna 23 persons Xin (Luna) Dong AuthorOfArticles(49) MentionedIn(20)

  8. Semex Without Deduplication A Platform for Personal Information Management and Integration

  9. Semex Without Deduplication 9 Persons: dong xin xin dong

  10. Semex NEEDS Deduplication (Reference Reconciliation)

  11. Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Madhavan @ Sigmod 2005 University of Washington

  12. Complex Information Space Example – An Abstract View of Personal Information • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null)

  13. Complex Information Space Example – An Abstract View of Personal Information • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “eugene@berkeley.edu”) p8=(null, “stonebraker@csail.mit.edu”) p9=(“mike”, “stonebraker@csail.mit.edu”) Association Attribute Class Atomic Attribute Reference

  14. Other Complex Information Spaces • Citation portals, e.g., Citeseer, Cora • Online product catalogs in E-commerce

  15. Real-World Objects • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “eugene@berkeley.edu”) p8=(null, “stonebraker@csail.mit.edu”) p9=(“mike”, “stonebraker@csail.mit.edu”)

  16. Reference Reconciliation • Input: A set of references R • Output: A partitioning over R, such that • Each partition refers to a single real-world object– high precision • Different partitions refer to different objects– high recall

  17. Related Work • A very active area of research in Databases, Data Mining and AI • Most current approaches assume matching tuples from a single database table • Traditional approaches (Surveyed in [Cohen, et al. 2003]) • Step I. Compare attributes • Step II. Combine attribute similarities to decide tuple match/non-match • Step III. Compute transitive closures to get partitions • New approaches explore relationship between reconciliation decisions using probability models[Russell et al, 2002] [Domingos et al, 2004] • Harder for complex information spaces

  18. ? ? Challenges in Complex Information Spaces • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “eugene@berkeley.edu”) p8=(null, “stonebraker@csail.mit.edu”) p9=(“mike”, “stonebraker@csail.mit.edu”) 2. LimitedInformation 1. Multiple Classes 3. Multi-value Attributes

  19. Intuition • Complex information spaces can be considered as networks of instances and associations between the instances • Key: exploit the network, specifically, the clues hidden in the associations

  20. Outline • Introduction and problem definition • Reconciliation algorithm • Experimental results • Conclusions

  21. Framework: Dependency Graph • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) Cross-attr similarity (“Michael Stonebraker”, p7) (p2, p8) Compare contacts (p1, “stonebraker@csail.mit.edu”) (p1,p7) (p3, “stonebraker@csail.mit.edu”) Reference Similarity Attribute Similarity

  22. Framework: Dependency Graph • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) Cross-attr similarity (p2, p8) Compare contacts Reference Similarity Attribute Similarity

  23. Framework: Dependency Graph • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (“Eugene Wong”, “Eugene Wong”) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p2, p9) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reference Similarity Attribute Similarity

  24. Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reference similarity Attribute similarity

  25. Dependency Graph Example II (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) Compare authored papers (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reference similarity Attribute similarity

  26. Strategy I. Consider Richer Evidence • Cross-attribute similarity – Name&email • p5=(“Stonebraker, M.”, null) • p8=(null, “stonebraker@csail.mit.edu”) • Context Information I – Contact list • p5=(“Stonebraker, M.”, null, {p4, p6}) • p8=(null, “stonebraker@csail.mit.edu”, {p7}) • p6=p7 • Context Information II – Authored articles • p2=(“Michael Stonebraker”, null) • p5=(“Stonebraker, M.”, null) • p2 and p5 authored the same article

  27. 1409 Considering Only Attribute-wise Similarities Cannot Merge Persons Well 3159 Person references: 24076 Real-world persons (gold-standard):1750

  28. 1409 346 Considering Richer Evidence Improves the Recall Person references: 24076 Real-world persons:1750

  29. Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reference similarity Attribute similarity

  30. Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reconciled Similar

  31. Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reconciled Similar

  32. Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reconciled Similar

  33. Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reconciled Similar

  34. Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reconciled Similar

  35. Strategy II. Propagate Information between Reconciliation Decisions • After changing the similarity score of one node, re-compute similarity scores of its neighbors • This process converges if • Similarity score is monotone in the similarity values of neighbors • Compute neighbor similarities only if similarity increase is not too small

  36. Propagating Information between Reconciliation Decisions Further Improves Recall Person references: 24076 Real-world persons:1750

  37. Strategy III. Enrich References in Reconciliation • Enrich knowledge of a real-world object for later reconciliation • Naïve: Construct graph  Compute similarity  Transitive Closure • Problems • Dependency-graph construction is expensive • Reference enrichment takes effect until the next pass • Solution • Instant enrichment by adding neighbors in the dependency graph

  38. Enrich References by Adding Neighbors • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p2, p9) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reconciled Similar

  39. Enrich References by Adding Neighbors • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p2, p9) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reconciled Similar

  40. Enrich References by Adding Neighbors • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p2, p9) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reconciled Similar

  41. Enrich References by Adding Neighbors • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reconciled Similar

  42. Enrich References by Adding Neighbors • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reconciled Similar

  43. References Enrichment Improves Recall More than Information Propagation Person references: 24076 Real-world persons:1750

  44. 1409 346 125 Applying Both Information Propagation and Reference Enrichment Get the Highest Recall Person references: 24076 Real-world persons:1750

  45. Outline • Introduction and problem definition • Reconciliation algorithm • Experimental results • Conclusions

  46. Experiment Settings • Datasets • Four personal datasets • Cora dataset for citations • Use the same parameters and thresholds for all datasets • Measure • Precision and recall, F-measure • Precision: The percentage of correctly reconciled reference pairs over all reconciled reference pairs • Recall: The percentage of correctly reconciled reference pairs over pairs of references that refer to the same real-world object • Diversity and Dispersion • Diversity: For every result partition, how many real-world objects are included; ideally should be 1 (related to precision) • Dispersion: For every real-world object, how many result partitions include them; ideally should be 1 (related to recall)

  47. 1409 346 125 Recall Results on One Personal Dataset Person references: 24076 Real-world persons:1750

  48. Results Considering All Occurrences of Person Instances Both precision and recall increase compared with attr-wise matching.

  49. Results Considering Only Distinct Person References Precision and recall increase largely compared with attr-wise matching.

  50. Diversity and Dispersion Are Very Close to 1

More Related