1 / 17

Interactive Deduplication using Active Learning

Interactive Deduplication using Active Learning. Sunita Sarawagi and Anuradha Bhamidipaty. Presented by Doug Downey. Active Learning for de-duplication. De-duplication systems try to learn a function: Where D is the data set. f is learned using a labeled training data set

fagan
Download Presentation

Interactive Deduplication using Active Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey

  2. Active Learning for de-duplication • De-duplication systems try to learn a function: • Where D is the data set. • f is learned using a labeled training data set • Normally, D is large, so many sets Lp are possible. • Choosing a representative & useful Lp is hard. • Instead of a fixed set Lp, in Active Learningthe learner interactively chooses pairs from DD to be labeled and added to Lp.

  3. The ALIAS de-duplicator • Input • Set Dp of pairs of data records represented as feature vectors (features might include edit distance, soundex, etc). • Initial set Lp of some elements of Dp labeled as duplicates or non-duplicates. • Set T = Lp Loop until user satisfaction: • Train classifier Cusing T. • Use Cto choose a set S of instances from Dp for labeling. • Get labels for S from user, and set T = T  S.

  4. The ALIAS de-duplicator

  5. Active Learning • How do we choose the set S of instances to label? • Idea: Choose most uncertain instances. • We’re given that +’s and –’s can be separated by some point, and assume that probability of – or + is linear between labeled examples rand b. • The point m • maximally uncertain, • also the point that reduces our “confusion region” the most. • So choose m!

  6. Measuring Uncertainty with Committees • Train a committee of several slightly different versions of a classifier. • Uncertainty(x) entropycommittee(x) • Form committees by • Randomizing model parameters • Partitioning training data • Partitioning attributes

  7. Methods for Forming Committees

  8. Committee Size

  9. Representativeness of an Instance • We need informative instances, not just uncertain ones. • Solution: samplenof the knmost uncertain instances, weighted by uncertainty. • k = 1  no sampling • kn = all data  full-sampling • Why not use information gain?

  10. Sampling for Representativeness

  11. Evaluation – Different Classifiers • Decision Trees & Naïve Bayes: • Committees of 5 via parameter randomization • SVMs • Uncertainty = distance from separator • Start with one dup, one non-dup, add a new training example each round (n = 1), partial sampling (k= 5). • Similarity functions – 3-Grams match, % overlapping words, approx. edit distance, special handling of #s/nulls. • Data sets: • Bibliography: 32131 citation pairs from Citeseer, 0.5% duplicates. • Address: 44850 pairs, 0.25% duplicates.

  12. Evaluation – different classifiers

  13. Evaluation – different classifiers

  14. Value of Active Learning

  15. Value of Active Learning

  16. Example Decision Tree

  17. Conclusions • Active Learning improves performance over random selection. • Uses two orders of magnitude less training data. • Note: not due just to change in +/- mix. • In these experiments, Decision Trees outperformed SVMs and Naïve Bayes.

More Related