1 / 25

Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004

Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004. Mihai Surdeanu. Goals. Introduce Steven Abney’s “Understanding the Yarowski Algorithm” (Computational Linguistics 30(3) 2004) paper

micol
Download Presentation

Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis of Bootstrapping AlgorithmsSeminar of Machine Learning for Text MiningUPC, 18/11/2004 Mihai Surdeanu

  2. Goals • Introduce Steven Abney’s “Understanding the Yarowski Algorithm” (Computational Linguistics 30(3) 2004) paper • What are the bootstrapping algorithms covered and their properties? • Will skip theorem proofs • What do they mean in the context of document clustering and pattern acquisition? • How do they compare with other iterative refinement clustering algorithms and with Yangarber 2003?

  3. Notations WSD: x – word j – word sense f – word/context feature Clustering: x – document j – category/domain f – doc feature: word, pattern

  4. Generic Yarowski Algorithm (Y-0) Needs a base learner Changes labeling only if prediction larger than arbitrary threshold Does not change labels of seeds Nothing formal can be shown about Y-0.

  5. Modified Algorithm (Y-1) A labeled example cannot become unlabeled again. Fixed threshold

  6. Properties of Y-1 • If the base learner reduces the divergence on the labeled (or all) examples, algorithm Y-1 decreases H (cross entropy – equation (6)) at each iteration until it reaches a critical point of H

  7. The Original Decision List Induction Algorithm (DL-0) Smooth precision with an arbitrary value Pick the label given by the rule with the best score  is NOT a probability distribution! Nothing formal can be shown about DL-0.

  8. The EM-based Decision List Algorithm (DL-EM) • A mixture of  is used to compute  (see above). Because  is a probability distribution,  is also a probability distribution. • Whereas in DL-0 the prediction is given by the “strongest” feature, here the algorithm permits a block of “weaker” features to outweigh the strongest feature. • DL-EM does not construct a classifier from scratch (like DL-0), but rather builds upon the previous classifier (fjold and xold).

  9. The EM-based Decision List Algorithm (DL-EM) Probability that feature f was responsible for label j for object x Normalization over all features

  10. Algorithm DL-EM- What are the (0) parameters??? A similar algorithm exists when the feature score is computed over all examples, not just the labeled ones: DL-EM-V.

  11. Properties of DL-EM-* • Y-1/DL-EM- and Y-1/DL-EM-V decrease H at each iteration until they reach a critical point of H (local minimum).

  12. Algorithm DL-1-R “Raw” precision Mixture of feature scores

  13. Algorithm DL-1-VS Precision with variable smoothing for each feature Mixture of feature scores

  14. Properties of DL-1-* • Y-1/DL-1-R minimizes K (an upper limit on H) over labeled examples  • Y-1/DL-1-VS minimizes K over all examples X

  15. So far… • Y-0/DL-0 – original Yarowski algorithm. Can not be shown to minimize H or K. • Y-1/DL-EM- and Y-1/DL-EM-V minimize H • Y-1/DL-1-R and Y-1/DL-1-VS minimize K

  16. Sequential Algorithms • All previous algorithms do “parallel” updates, in the sense that the parameters {fj} are all recomputed at every iteration. • Sequential algorithms: one feature is selected at each iteration: St+1 = St U {ft} • Only the score of the selected feature and the scores of the documents containing a chosen feature are recomputed. • More flexible – shown to converge for more base learners.

  17. Algorithm YS Choose a feature that: (1) Is not seed (2) Is seen in training (3) Its score changed

  18. Base Learners for YS Biased towards the feature that maximizes raw precision = anti-smoothing

  19. Properties of YS-* • YS-P and YS-R reduce K in every iteration. • YS-FS reduce K in every iteration for new features.

  20. Yarowski versus Co-training • Co-training attempts to maximize agreement on unlabeled data between classifiers trained on different “views” of the data. • The modified Yarowski algorithms introduced in this paper reduce the upper limit on entropy (H), similarly to co-training. • Co-training uses an assumption of at least two independent views of the data. Hence it is more restricted.

  21. YS versus Yangarber (1) set  = 1, else  = 0 NOT a probability distribution Recompute 

  22. YS versus Yangarber (2) • Yangarber does not require the computation of Y, as its goal is to learn patterns (features) relevant for each label (category) • A plus for Yangarber as Yx = ŷ is a VERY strong statement in document classification = classifies a document based on the limited information available in this iteration • Y can be computed as a side effect when the algorithm completes. This is used as an indirect evaluation.

  23. YS versus Yangarber (3) • The base learner for Yangarber generates scores that are NOT probability distributions! Hard to analyze the algorithm formally! fj = raw_precision(f,j) * log(how many documents contain f) This part similar to YS-R ???

  24. Bootstrapping versus K-Means and EM • K-Means and Bootstrapping “hard” classify objects in each iteration: Yx = ŷ. EM (and Yangarber) compute Y only in the last iteration. • I think K-Means and EM converge more rapidly because they accumulate more features faster than bootstrapping. • In K-Means basically after the first iteration all features are in use. • In FS (and Yangarber) only one (or a very small number) of the features is selected in every iteration.

  25. Conclusions • Abney  simple modifications of the Yarowski bootstrapping algorithm can be formally shown to converge to a local minimum (like EM) • Based on this work  Yangarber (and Riloff) are far from the formalization required to show that they converge • Is there a better algorithm for pattern learning?

More Related