1 / 23

Active Sampling for Entity Matching

Active Sampling for Entity Matching. Aditya Parameswaran Stanford University Jointly with: Kedar Bellare , Suresh Iyengar , Vibhor Rastogi Yahoo! Research. Entity Matching. Goal : Find duplicate entities in a given data set Fundamental data cleaning primitive  decades of prior work

limei
Download Presentation

Active Sampling for Entity Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Active Sampling for Entity Matching AdityaParameswaran Stanford University Jointly with: KedarBellare, Suresh Iyengar, VibhorRastogi Yahoo! Research

  2. Entity Matching Goal: Find duplicate entities in a given data set Fundamental data cleaning primitive  decades of prior work Especially important at Yahoo! (and other web companies) Homma’s Brown Rice Sushi California Avenue Palo Alto Homma’s Sushi Cal Ave Palo Alto

  3. Why is it important? Applications: • Business Listings in Y! Local • Celebrities in Y! Movies • Events in Y! Upcoming • …. Websites Yelp Zagat Foursq ??? Find Duplicates Deduplicated Entities Dirty Entities Databases Content Providers

  4. How? Reformulated Goal: Construct a high quality classifier identifying duplicate entity pairs Problem: How do we select training data? Answer: Active Learning with Human Experts!

  5. Reformulated Workflow Websites Our Technique Deduplicated Entities Dirty Entities Databases Content Providers

  6. Active Learning (AL) Primer Work even under noisy settings } Properties of an AL algorithm: • Label Complexity • Time Complexity • Consistency Prior work: • Uncertainty Sampling • Query by Committee • … • Importance Weighted Active Learning (IWAL) • Online IWAL without Constraints • Implemented in VowpalWabbit (VW) • 0-1 Metric • Time and Label efficient • Provably Consistent

  7. Problem One: Imbalanced Data • Typical to have 100:1 even after blocking • Solution: Metric from [Arasu11]: • Maximize Recall • Such that Precision > τ Non-matches Matches 100 1 • Solution: All Non-matches • Precision 100% • 0-1 Error ≈ 0 Correctly identified matches % of correct matches

  8. Problem Two: Guarantees • Prior work on Entity Matching • No guarantees on Recall/Precision • Even if they do, they have: • High time + label complexity • Can we adapt prior work on AL for the new objective: • Maximize recall, such that precision > τ • With: • Sub-linear label complexity • Efficient time complexity

  9. Overview of Our Approach Recall Optimization with Precision Constraint This talk Reduction: Convex-hull Search in Relaxed Lagrangian Weighted 0-1 Error Paper Reduction: Rejection Sampling Active Learning with 0-1 Error

  10. Objective Given: • Hypothesis class H, • Threshold τin[0,1] Objective: Find h in H that • Maximizes recall(h) • Such that: precision(h) >= τ Equivalently: • Maximize-falseneg(h) • Such that:εtruepos(h) -falsepos(h) >= 0 • Where ε = τ/(1-τ)

  11. Unconstrained Objective X(h) Y(h) Weighted 0-1 objective Current formulation: • Maximize-falseneg(h) εtruepos(h) -falsepos(h) >= 0 If we introduce lagrange multiplier λ: • Maximize X(h) + λ Y(h), can be rewritten as: • Minimizeδfalseneg (h) + (1 – δ) falsepos(h)

  12. Convex Hull of Classifiers Convex shape formed by joining classifiers strictly dominating others Y(h) We want a classifier here 0 Can have exponential number of points inside X(h) Maximize X(h) Such that Y(h) >= 0

  13. Convex Hull of Classifiers u-v For any λ>0, there is a point / line with largest value of X + λ Y u Y(h) Plug λ into weighted objective, get classifier h with highest X(h) + λ Y(h) v If λ=-1/slope of a line, we get a classifier on the line, else we get a vertex classifier. X(h) Maximize X(h) Such that Y(h) >= 0

  14. Convex Hull of Classifiers Naïve strategy: try all λ Equivalently, try all slopes Too long! Y(h) Worst case, we get this point Instead, do binary search for λ • Problem: When to stop? • 1) Bounds • 2) Discretization of λ • Details in Paper! X(h) Maximize X(h) Such that Y(h) >= 0

  15. Algorithm I (Ours  Weighted) • Given: AL black box C for weighted 0-1 error • Goal: Precision constrained objective • Range of λ: [Λmin,Λmax] • Don’t enumerate all candidate λ too expensive; O(n3) • Instead, discretized using factor θ see paper! • Binary search over discretized values • Same complexity as binary search • O(log n)

  16. Algorithm II (Weighted  0-1) • Given: AL black box B for 0-1 error • Goal: AL Black box C for weighted 0-1 error • Use trick from Supervised Learning [Zadrozny03] • Cost-sensitive objective  Binary • Reduction by rejection sampling

  17. Overview of Our Approach Recall Optimization with Precision Constraint This talk O(log n) Reduction: Convex-hull Search in Relaxed Lagrangian Weighted 0-1 Error Paper O(log n) Reduction: Rejection Sampling Labels = O(log2 n) L(B) • Time = O(log2 n) T(B) Active Learning with 0-1 Error

  18. Experiments • Four real-world data sets • All labels known • Simulate active learning • Two approaches for AL with Precision Constraint: • Ours • With VowpalWabbit as 0-1 AL Black Box • Monotone [Arasu11] • Assumes monotonicity of similarity features • High computational + label complexity

  19. Results I (Runtime with #Features) Computational complexity on UCI Person

  20. Results II (Quality & #Label Queries) Business Person

  21. Results II (Contd.) DBLP-ACM Scholar

  22. Results III (0-1 Active Learning) Precision Constraint Satisfaction % of 0-1 AL

  23. Conclusion • Active learning for Entity Matching • Can use any 0-1 AL as black box • Great real world performance: • Computationally efficient (600k examples in 25 seconds) • Label efficient and better F-1 on four real-world tasks • Guaranteed • Precision of matcher • Time and label complexity

More Related