1 / 21

Constraint-Based Entity Matching

Constraint-Based Entity Matching. Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana. Entity Matching. Decide if mentions refer to the same real-world entity Key problem in numerous applications Information integration Natural language understanding

parry
Download Presentation

Constraint-Based Entity Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Constraint-Based Entity Matching Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana

  2. Entity Matching • Decide if mentions refer to the same real-world entity • Key problem in numerous applications • Information integration • Natural language understanding • Semantic Web Chris Li, Jane Smith. “Numerical Analysis”. SIAM 2001 Chen Li, Doug Chan. “Ensemble Learning” C. Li, D. Chan. “Ensemble Learning”. ICML 2003

  3. State of the Art • Numerous solutions in the AI, Database, and Web communities • Cohen, Ravikumar, & Fienberg 2003 • Li, Morie, & Roth 2004 • Bhattacharya & Getoor 2004 • McCallum, Nigam, & Ungar 2000 • Pasula et. al. 2003 • Wellner et. al. 2004 • Most solutions largely exploit only syntactic similarity • “Jeff Smith” ≈ “J. Smith” • “(217) 235-1234” ≈ “235-1234”

  4. Semantic Constraints Incompatible Subsumption Layout C. Li. “User Interfaces”. SIGCHI 2000 C. Li, J. Smith. “Numerical Analysis”. SIAM 2001 “Numerical Analysis”, SIAM 2001 withJ. Smith. Chris Li’s Homepage Chris Li, Jane Smith. “Numerical Analysis”. SIAM 2001 DBLP Chen Li, Doug Chan. “Ensemble Learning”. ICML 2003 C. Li. “Data Mining”. KDD 2000 Chen Li’s Homepage

  5. Numerous Semantic Constraint Types

  6. Our Contributions • Develop a solution to exploit semantic constraints • Models constraints in a uniform probabilistic manner • Clusters mentions using a generative model • Uses relaxation labeling to handle constraints • Adds a pairwise layer to further improve accuracy • Experimental results on two real-world domains • Researchers, IMDB • Improved accuracy over state of the art by 3-12% F-1

  7. m1: Chen Li  e1 m2: C. Li Probabilistic Modeling of Constraints • Modeled as the effect on the probability that a mention refers to a real-world entity “If two mentions in the same document share similar names, they are likely to match”: • Constraint probabilities have a natural interpretation • Can be learned or manually specified by a domain expert • P (m2=e1| m1 = e1) = 0.8

  8. m3:Chris Lee m1:Chen Li m2:C. Li Documents: d2 d1 c1 = layout constraint p(c1) = 0.8 Constraints: m1 = m2 Matching Pairs: The Entity Matching Problem Solution • Model document generation • Cluster mentions using this model

  9. E e1 Chen Li e2 Chris Lee e2 Chris Lee m3: Chris Lee m1:Chen Li m2:C. Li c1: layout constraint p(c1) = 0.8 d2 d1 Modeling Document Generation • Generate mentions for each document • Select entities • Generate and “sprinkle” mentions • Check constraints for each mention • Decide whether to enforce constraint c • If enforced, check if mention violates c • If yes, discard documents and repeat process (Extension of model in Li, Morie & Roth 2004)

  10. . . . Clustering with the Generative Model • Find mention assignments F and model parameters  to maximize P (D, F | ) • Difficult to compute exactly, so use a variant of EM

  11. Incorporating Constraints • Extend the step that assigns mentions • Basic mention assignment: • Extension: Use constraints to improve mention assignments

  12. Compute parameters Assign mentions Apply constraints Enforcing Constraints on Clusters • Apply constraints at each iteration • Use relaxation labeling to apply constraints to mention assignments

  13. Relaxation Labeling • Start with an initial labeling of mentions with entities • Iteratively improve mention labels, given constraints • Can be extended to probabilistic constraints • Scalable Chen Li = e1 C. Li = e2 Y. Lee = e3 Chris Lee = e2 Jane Smith = e4 C. Lee = e2 Smith, J = e4 Constraints: c1 = layout constraint p(c1) = 0.8

  14. Relaxation Labeling • Start with an initial labeling of mentions with entities • Iteratively improve mention labels, given constraints • Can be extended to probabilistic constraints • Scalable Chen Li = e1 C. Li = e2  e1 Y. Lee = e3 Chris Lee = e2 Jane Smith = e4 C. Lee = e2 Smith, J = e4 Constraints: c1 = layout constraint p(c1) = 0.8

  15. Handling Probabilistic Constraints • Relaxation labeling can combine multiple probabilistic constraints

  16. Compute parameters Assign mentions Apply constraints Li, Chen Chen Li C. Li Li, C. Pairwise Layer • So far, we have applied constraints to clusters • It may be unclear how to enforce constraints on clusters • Add a pairwise layer • Convert clusters into predicted matching pairs • Remove only pairs that negative pairwise hard constraints apply to Constraint: C. Li≠ Li, C. Remove C. Li or Li, C. ?

  17. Empirical Evaluation • Two real-world domains • Researchers, IMDB • For each domain • Collected documents • Researchers: homepages from DBLP and the web • IMDB: text and structured records from IMDB • Marked up mentions and their attributes • 4,991 researcher mentions • 3,889 movie titles from IMDB • Manually identified all correct matching pairs • Evaluation Metric: Precision = # true positives / # predicted pairs Recall = # true positives / # correct pairs F1 = (2 * P * R) / (P + R)

  18. F1 (P / R) Researchers Movies Baseline .66 (.67/.65) .69 (.61/.79) Baseline + Relax .78 (.78/.78) .72 (.63/.83) Baseline + Relax + Pairwise .79 (.80/.79) .73 (.64/.83) Using Constraints Improves Accuracy • Relaxation labeler improves F-1 by 3-12% • Relaxation labeling very fast

  19. Researchers F1 (P / R) Baseline .66 (.67/.65) + Rare Value .66 (.67/.66) + Subsumption .67 (.68/.65) + Neighborhood .70 (.68/.72) + Individual .70 (.77/.64) + Layout .71 (.68/.74) Movies F1 (P / R) Baseline .69 (.61/.79) + Incompatible .70 (.62/.79) + Neighborhood .70 (.62/.81) + Individual .71 (.62/.82) Using Constraints Individually • Each constraint makes a contribution

  20. Related Work • Much work in entity matching Cohen, Ravikumar, & Fienberg 2003 Li, Morie, & Roth 2004 Bhattacharya & Getoor 2004 McCallum, Nigam, & Ungar 2000 Pasula et. al. 2003 Wellner et. al. 2004 • Recent work has looked at exploiting semantic constraints • Personal Information Management (Dong et. al. 2004) • Profiler based entity matching (Doan et. al. 2003) • Semantic constraints successfully exploited in other applications • Clustering algorithms (Bilenko et. al. 2004), ontology matching (Doan et. al. 2002)

  21. Summary and Future Work • Exploit semantic constraints in entity matching • Models constraints in a uniform probabilistic manner • Uses a generative model and relaxation labeling to handle constraints in a scalable way • Experimental results on two real-world domains show effectiveness • Future work: Learning constraints effectively from current or external data

More Related