1 / 28

Entity Profiling with Varying Source Reliabilities

Entity Profiling with Varying Source Reliabilities. Furong Li, Mong Li Lee, Wynne Hsu { furongli , leeml , whsu } @ comp.nus.edu.sg National University of Singapore. Outline. Motivation Proposed Method Performance Study Conclusion. Entities in Multiple Sources.

garin
Download Presentation

Entity Profiling with Varying Source Reliabilities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Entity Profiling with Varying Source Reliabilities Furong Li, Mong Li Lee, Wynne Hsu {furongli, leeml, whsu} @ comp.nus.edu.sg National University of Singapore

  2. Outline Motivation Proposed Method Performance Study Conclusion

  3. Entities in Multiple Sources • Various name representations

  4. Entities in Multiple Sources • Various name representations • Erroneous attribute values

  5. Entities in Multiple Sources No opening hours • Various name representations • Erroneous attribute values • Incomplete information

  6. Entities in Multiple Sources • Various name representations • Erroneous attribute values • Incomplete information • Ambiguous references

  7. What We Want • Various name representations • Erroneous attribute values • Incomplete information • Ambiguous references Entity Profiling

  8. The Problem Involves Two Tasks • Record linkage • [Getoor et al, VLDB’12], [Negahban et al, CIKM’12] • Truth discovery • [Li et al, VLDB’13], [Yin et al, KDD’07] Frank • The erroneous values may prevent correct linkages • Incomplete picture of the data limits the effectiveness of truth discovery

  9. State-of-the-art Method • [Guo et al, VLDB’10] • Assume a set of (soft) uniqueness constraints • Transform records into attribute value pairs • Limitations • Make wrong associations when the percentage of erroneous values increases • Computationally expensive • The uniqueness constraint limits its generality

  10. Outline Motivation Proposed Method Performance Study Conclusion

  11. A Motivating Example -> profiles matching ? √ √ X √ X

  12. A Motivating Example -> profiles matching ? √ √ X √ X

  13. The Example Tells Us • The data sources are not equally reliable among different attributes • Introduce a reliability matrix • Lower the impact of erroneous values on matching decisions • Rectifying errors in attribute values provides additional evidence for linking records • Interleave the processes of record linkage and error correction

  14. The Proposed Two-phase Method X X

  15. Confidence Based Matching X X Bootstrap the framework with a small set of confident matches Initialize reliability matrix based on confident matches

  16. Confidence Based Matching • Distinguish sources • Reliable: • Unreliable:

  17. Confidence Based Matching • Distinguish sources • Reliable: • Unreliable:

  18. Confidence Based Matching • Distinguish sources • Reliable: • Unreliable:

  19. Adaptive Matching X X • Cluster signatures <- current truth • Update reliability matrix • Refine clusters by pruning • the most dissimilar records Discount records belonging to multiple clusters Lower the impact of erroneous values on our matching decision

  20. Outline Motivation Proposed Method Performance Study Conclusion

  21. Comparative Methods • PIPELINE • Record linkage [1] + Truth discovery [2] • MATCH [3] • State-of-the-art method • COMET • The proposed method • Negahban et al. Scaling multiple-source entity resolution using statistically efficient transfer learning. In CIKM, 2012 • Yin et al. TruthFinder. In KDD, 2007 • Guo et al. Record linkage with uniqueness constraints and erroneous values. VLDB, 2010

  22. Results on Restaurant Dataset

  23. %err: percentage of erroneous values • %ambi: percentage of records with abbreviated names

  24. %err: percentage of erroneous values • %ambi: percentage of records with abbreviated names

  25. Scalability Experiments

  26. Outline Motivation Proposed Method Performance Study Conclusion

  27. Conclusion Address the problem of building entity profiles by collating data records from multiple sources in the presence of erroneousvalues Interleaverecord linkage with truth discovery Varying source reliabilities Reduce the impact of erroneous values on matching decisions

  28. Thanks!Q & A furongli@comp.nus.edu.sg

More Related