1 / 71

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 21 4/3 /2013. Recommended reading. Banko & Brill. 2001. Scaling to very very large corpora for natural language disambiguation.

cahil
Download Presentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing • Lecture 21 • 4/3/2013

  2. Recommended reading • Banko & Brill. 2001. Scaling to very very large corpora for natural language disambiguation. • Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. Proceedings of COLT. • Best 10-year paper, awarded in 2008 • Thorsten Joachims. 1999. Transductive inference for text classification using Support Vector Machines. ICML. • Best 10-year paper, awarded in 2009 • David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. ACL. • Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. EMNLP.

  3. Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4

  4. Data quantity vs. performance • NLP: fully annotated data sets for testing machine learning algorithms • WSJ (1.3 million words) • Brown corpus (1 million words) • Prague dependency treebank (2 million words) • What happens when we train on much larger data sets?

  5. Banko & Brill 2001 • “Confusion set disambiguation” • { principle, principal } { then, than } • { to, two, too } { weather, whether } • Corpus generation • Replace all occurrences by a marker • The school hired a new principal  • The school hired a new PRINCIPLE/PRINCIPAL. • Easy to generate data sets for very large corpora • 1 billion word corpus • Task • Algorithm must choose correct word • Similar to word sense disambiguation

  6. Banko & Brill 2001 • Algorithms tested • Winnow • Perceptron • Naive Bayes • Memory-based / nearest-neighbor • Features: • Words within a window • POS n-grams • Word n-grams

  7. Performance comparison

  8. Banko & Brill 2001 • Conclusions: • Quantity of data positively effects performance (more data is better) • Relative performance of different algorithms differs depending on amount of training data (this is disturbing; makes standardized test sets for algorithm comparison seem less meaningful)

  9. Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4

  10. Annotated data is expensive($ and time) • (from J. Zhu via A. Blum)

  11. Data, annotation, and performance • Not much labeled data (annotated) • Lots of unlabeled data (unannotated) • Limited level of performance from training on labeled data only • Can we use unlabeled data to improve performance?

  12. Utilizing unlabeled data • Easy to collect unlabeled data • Existing corpora containing billion(s) of words • Internet • Unlabeled data: • Missing the most important information (labels) • But there are other statistical regularities that can be exploited

  13. Amount of supervision • Supervised learning: • Given a sample of object-label pairs (xi , yi), find the predictive relationship between objects and labels • Unsupervised learning: • Discover structures in unlabeled data • Semi-supervised learning: use both labeled and unlabeled data • Supervised learning + additional unlabeled data • Unsupervised learning + additional labeled data (“bootstrapping”)

  14. Semi-supervised learning algorithms and applications • Supervised learning + additional unlabeled data • Transductive SVM • Co-training • Web page classification • Unsupervised learning + additional labeled data (“bootstrapping”) • Yarowsky algorithm; bootstrapping with seeds • Word sense disambiguation • Co-training with decision list • Named Entity Recognition

  15. Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4

  16. Inductive vs. transductive SVM • Inductive • Find max margin hyperplane on training set • Standard SVM algorithm • Transductive • Useful when only a small amount of data is labeled • Goal is really to minimize error on test set • Take testing data into account when finding max margin hyperplane

  17. Inductive vs. transductive SVM • Transductive SVM has better performance than standard SVM Transductive SVM max margin hyperplane Inductive SVM max margin hyperplane Additional unlabeled data points; assign to nearest class from labeled data

  18. Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4

  19. Co-training: Blum & Mitchell 1998 • Combines 2 ideas • Semi-supervised training • Small amount of labeled data • Larger amount of unlabeled data • Use two supervised classifiers simultaneously • Outperforms a single classifier

  20. Example problem • Collected a data set • 1051 web pages from CS departments at 4 universities • Manually labeled as + or - + is a course home page (22% of web pages) - is not a course home page (rest of web pages) • Use Naïve Bayes to classify web pages

  21. Prof. Avrim Blum My Advisor Prof. Avrim Blum My Advisor x1- Link info x - Link info & Text info x2- Text info Features for web page classification • x1: text in hyperlinks (bag of words) <a href = … >CS 100, Fall semester</a> • x2: text in the page (bag of words) <html>Exam #1</html> • Training instances contain both features: x = (x1, x2)

  22. Views • A sufficient set of features is called a view • Each view by itself is sufficient to produce an optimal classifier • For web page example, pages can be classified accurately with either text or hyperlinks • Two views are conditionally independent (given the label) • p(x1|x2, y) = p(x1|y) • p(x2|x1, y) = p(x2|y)

  23. Co-Training algorithm • Start with small portion of labeled data • Train two classifiers from the same pool of data; each classifier is based on a different “view” • Use the two classifiers to label the data set • Data points that are classified with high confidence are added to pool of labeled data • Amount of labeled data gradually increases until it covers the entire data set

  24. Co-Training algorithm

  25. Error rate in classifying web pages • Combined classifier • Supervised: combine features with Naïve Bayes: p(cj|x1,2) = p(cj|x1)p(cj|x2) • Co-training: use both page-based and hyperlink-based

  26. Co-training: error rate vs. # iterations Baseline: always predict “not a course web page” Page-based classifier Hyperlink-based classifier

  27. Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4

  28. Decision List • A simple discriminative classifier • Compute argmaxC p(C|F) • Compare: p(C1|f1), p(C2|f1), … p(C1|fn), p(C2|fn) • Choose class based on largest difference in p( Ci | fj ) for a feature fj in the data to be classified

  29. Decision List for WSD: p(sense|feature) • The decision list compares the conditional probabilities of senses given various features, to determine the probabilistically most likely sense for a word. • Example: disambiguate ‘bank’ in this sentence: • I checked my boat at the marina next to the bank of the river near the shore. • p( money-sense | ‘check’ ) • p( river-sense| ‘check’ ) • … • p( money-sense | ‘shore’ ) • p( river-sense | ‘shore’ )  let’s say this has highest prob

  30. Automatically build disambiguation system • Yarowsky’s method: • Get corpus with words annotated for different categories • Formulate templates for generation of disambiguating rules • Algorithm constructs all such rules from a corpus • Algorithm selects relevant rules through statistics of usage for each category • Methodology can be applied to any binary disambiguation problem

  31. Rule templates Possible rules + Ranked rules annotated corpus Statistics of usage

  32. Decision list algorithm: step 1, identify ambiguities • Example problem: accent restoration

  33. Step 2: Collect training contexts • Begin with an annotated corpus • (In this context, a corpus with accents indicated)

  34. Step 3: Specify rule templates • Given a particular training context, collect: • Word immediately to the right (+1 W) or left (-1 W) • Word found in ±k word window • Pair of words at fixed offsets • Other evidence can be used: • Lemma (morphological root) • Part of speech category • Other types of word classes (e.g. set of days of week)

  35. Step 4a: Count frequency of rules for each category

  36. Step 4b: Turn rule frequencies into probabilities

  37. Which rules are indicative of a category? • Two categories c1 and c2; p(c1|rule) + p(c2|rule) = 1 • Log-likelihood ratio: log( p(c1|rule) / p(c2|rule) ) • If p(c1|rule) = 0.5 and p(c2|rule) = 0.5, doesn’t distinguish log( p(c1 | rule) / p(c2 | rule) ) = 0 • If p(c1|rule) > 0.5 and p(c2|rule) < 0.5, c1 is more likely log( p(c1 | rule) / p(c2 | rule) ) > 0 • If p(c1|rule) < 0.5 and p(c2|rule) > 0.5, c2 is more likely log( p(c1 | rule) / p(c2 | rule) ) < 0

  38. Which rules are best for disambiguating between categories? • Use absolute value of log-likelihood ratio: abs(log( p(sense1 | rule) / p(sense2 | rule) )) • Rank rules by abs. value of log-likelihood ratio • Rules that best distinguish between the two categories are ranked highest

  39. Step 5: Choose rules that are indicative of categories: sort by abs(LogL) • This is the final decision list

  40. Step 6: classify new data with decision list • For a sentence with a word to be disambiguated: • Go down the ranked list of rules in the decision list • Find the first rule with a matching context • Assign a sense according to that rule • Finished. • Ignore other lower-ranked rules, even if they have matching contexts as well

  41. Example: disambiguate “plant” • Radiation from the crippled nuclear plant in Japan is showing up in rain in the United States.

  42. Limitations of supervised WSD • Practical issue in applying the algorithm to WSD: need a corpus tagged for word senses • If you have a large corpus fully annotated, WSD would be easy enough • But producing such a corpus would be extremely laborious • Senseval and other corpora only provide partial coverage • Another problem: each word is a unique disambiguation problem • Later: apply this algorithm in a semi-supervised setting (Yarowsky 1995)

  43. Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4

  44. Bootstrapping http://www.theproducersperspective.com/wp-content/uploads/2012/09/N_Bootstraps00094.jpg http://www.cerebralmastication.com/wp-content/uploads/2010/06/boot.jpg http://s3.amazonaws.com/spnproduction/photos/000/008/203/8203_2d878faf73_small.png?1342796341 http://desktopreality.com/wp-content/uploads/2012/03/bootstrap.jpg

  45. Yarowsky 1995 • “Unsupervised word sense disambiguation rivaling supervised methods” • Actually semi-supervisedbootstrapping: • Very small amount of human-annotated data • Iterative procedure for label propagation over unlabeled data set

  46. One sense per discourse hypothesis • “Words tend to exhibit only one sense in a given discourse or document”

  47. Step 1: identify all examples of target word • Store contexts in initial untagged training set

  48. Step 2: tag a small number of examples • For each sense, label a small set of training examples by hand • Do this according to seed words (features) • Example, “plant”: • manufacturing vs. life

  49. Sample initial state after step 2

More Related