slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
LING / C SC 439/539 Statistical Natural Language Processing PowerPoint Presentation
Download Presentation
LING / C SC 439/539 Statistical Natural Language Processing

Loading in 2 Seconds...

  share
play fullscreen
1 / 71
cahil

LING / C SC 439/539 Statistical Natural Language Processing - PowerPoint PPT Presentation

186 Views
Download Presentation
LING / C SC 439/539 Statistical Natural Language Processing
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. LING / C SC 439/539Statistical Natural Language Processing • Lecture 21 • 4/3/2013

  2. Recommended reading • Banko & Brill. 2001. Scaling to very very large corpora for natural language disambiguation. • Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. Proceedings of COLT. • Best 10-year paper, awarded in 2008 • Thorsten Joachims. 1999. Transductive inference for text classification using Support Vector Machines. ICML. • Best 10-year paper, awarded in 2009 • David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. ACL. • Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. EMNLP.

  3. Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4

  4. Data quantity vs. performance • NLP: fully annotated data sets for testing machine learning algorithms • WSJ (1.3 million words) • Brown corpus (1 million words) • Prague dependency treebank (2 million words) • What happens when we train on much larger data sets?

  5. Banko & Brill 2001 • “Confusion set disambiguation” • { principle, principal } { then, than } • { to, two, too } { weather, whether } • Corpus generation • Replace all occurrences by a marker • The school hired a new principal  • The school hired a new PRINCIPLE/PRINCIPAL. • Easy to generate data sets for very large corpora • 1 billion word corpus • Task • Algorithm must choose correct word • Similar to word sense disambiguation

  6. Banko & Brill 2001 • Algorithms tested • Winnow • Perceptron • Naive Bayes • Memory-based / nearest-neighbor • Features: • Words within a window • POS n-grams • Word n-grams

  7. Performance comparison

  8. Banko & Brill 2001 • Conclusions: • Quantity of data positively effects performance (more data is better) • Relative performance of different algorithms differs depending on amount of training data (this is disturbing; makes standardized test sets for algorithm comparison seem less meaningful)

  9. Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4

  10. Annotated data is expensive($ and time) • (from J. Zhu via A. Blum)

  11. Data, annotation, and performance • Not much labeled data (annotated) • Lots of unlabeled data (unannotated) • Limited level of performance from training on labeled data only • Can we use unlabeled data to improve performance?

  12. Utilizing unlabeled data • Easy to collect unlabeled data • Existing corpora containing billion(s) of words • Internet • Unlabeled data: • Missing the most important information (labels) • But there are other statistical regularities that can be exploited

  13. Amount of supervision • Supervised learning: • Given a sample of object-label pairs (xi , yi), find the predictive relationship between objects and labels • Unsupervised learning: • Discover structures in unlabeled data • Semi-supervised learning: use both labeled and unlabeled data • Supervised learning + additional unlabeled data • Unsupervised learning + additional labeled data (“bootstrapping”)

  14. Semi-supervised learning algorithms and applications • Supervised learning + additional unlabeled data • Transductive SVM • Co-training • Web page classification • Unsupervised learning + additional labeled data (“bootstrapping”) • Yarowsky algorithm; bootstrapping with seeds • Word sense disambiguation • Co-training with decision list • Named Entity Recognition

  15. Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4

  16. Inductive vs. transductive SVM • Inductive • Find max margin hyperplane on training set • Standard SVM algorithm • Transductive • Useful when only a small amount of data is labeled • Goal is really to minimize error on test set • Take testing data into account when finding max margin hyperplane

  17. Inductive vs. transductive SVM • Transductive SVM has better performance than standard SVM Transductive SVM max margin hyperplane Inductive SVM max margin hyperplane Additional unlabeled data points; assign to nearest class from labeled data

  18. Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4

  19. Co-training: Blum & Mitchell 1998 • Combines 2 ideas • Semi-supervised training • Small amount of labeled data • Larger amount of unlabeled data • Use two supervised classifiers simultaneously • Outperforms a single classifier

  20. Example problem • Collected a data set • 1051 web pages from CS departments at 4 universities • Manually labeled as + or - + is a course home page (22% of web pages) - is not a course home page (rest of web pages) • Use Naïve Bayes to classify web pages

  21. Prof. Avrim Blum My Advisor Prof. Avrim Blum My Advisor x1- Link info x - Link info & Text info x2- Text info Features for web page classification • x1: text in hyperlinks (bag of words) <a href = … >CS 100, Fall semester</a> • x2: text in the page (bag of words) <html>Exam #1</html> • Training instances contain both features: x = (x1, x2)

  22. Views • A sufficient set of features is called a view • Each view by itself is sufficient to produce an optimal classifier • For web page example, pages can be classified accurately with either text or hyperlinks • Two views are conditionally independent (given the label) • p(x1|x2, y) = p(x1|y) • p(x2|x1, y) = p(x2|y)

  23. Co-Training algorithm • Start with small portion of labeled data • Train two classifiers from the same pool of data; each classifier is based on a different “view” • Use the two classifiers to label the data set • Data points that are classified with high confidence are added to pool of labeled data • Amount of labeled data gradually increases until it covers the entire data set

  24. Co-Training algorithm

  25. Error rate in classifying web pages • Combined classifier • Supervised: combine features with Naïve Bayes: p(cj|x1,2) = p(cj|x1)p(cj|x2) • Co-training: use both page-based and hyperlink-based

  26. Co-training: error rate vs. # iterations Baseline: always predict “not a course web page” Page-based classifier Hyperlink-based classifier

  27. Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4

  28. Decision List • A simple discriminative classifier • Compute argmaxC p(C|F) • Compare: p(C1|f1), p(C2|f1), … p(C1|fn), p(C2|fn) • Choose class based on largest difference in p( Ci | fj ) for a feature fj in the data to be classified

  29. Decision List for WSD: p(sense|feature) • The decision list compares the conditional probabilities of senses given various features, to determine the probabilistically most likely sense for a word. • Example: disambiguate ‘bank’ in this sentence: • I checked my boat at the marina next to the bank of the river near the shore. • p( money-sense | ‘check’ ) • p( river-sense| ‘check’ ) • … • p( money-sense | ‘shore’ ) • p( river-sense | ‘shore’ )  let’s say this has highest prob

  30. Automatically build disambiguation system • Yarowsky’s method: • Get corpus with words annotated for different categories • Formulate templates for generation of disambiguating rules • Algorithm constructs all such rules from a corpus • Algorithm selects relevant rules through statistics of usage for each category • Methodology can be applied to any binary disambiguation problem

  31. Rule templates Possible rules + Ranked rules annotated corpus Statistics of usage

  32. Decision list algorithm: step 1, identify ambiguities • Example problem: accent restoration

  33. Step 2: Collect training contexts • Begin with an annotated corpus • (In this context, a corpus with accents indicated)

  34. Step 3: Specify rule templates • Given a particular training context, collect: • Word immediately to the right (+1 W) or left (-1 W) • Word found in ±k word window • Pair of words at fixed offsets • Other evidence can be used: • Lemma (morphological root) • Part of speech category • Other types of word classes (e.g. set of days of week)

  35. Step 4a: Count frequency of rules for each category

  36. Step 4b: Turn rule frequencies into probabilities

  37. Which rules are indicative of a category? • Two categories c1 and c2; p(c1|rule) + p(c2|rule) = 1 • Log-likelihood ratio: log( p(c1|rule) / p(c2|rule) ) • If p(c1|rule) = 0.5 and p(c2|rule) = 0.5, doesn’t distinguish log( p(c1 | rule) / p(c2 | rule) ) = 0 • If p(c1|rule) > 0.5 and p(c2|rule) < 0.5, c1 is more likely log( p(c1 | rule) / p(c2 | rule) ) > 0 • If p(c1|rule) < 0.5 and p(c2|rule) > 0.5, c2 is more likely log( p(c1 | rule) / p(c2 | rule) ) < 0

  38. Which rules are best for disambiguating between categories? • Use absolute value of log-likelihood ratio: abs(log( p(sense1 | rule) / p(sense2 | rule) )) • Rank rules by abs. value of log-likelihood ratio • Rules that best distinguish between the two categories are ranked highest

  39. Step 5: Choose rules that are indicative of categories: sort by abs(LogL) • This is the final decision list

  40. Step 6: classify new data with decision list • For a sentence with a word to be disambiguated: • Go down the ranked list of rules in the decision list • Find the first rule with a matching context • Assign a sense according to that rule • Finished. • Ignore other lower-ranked rules, even if they have matching contexts as well

  41. Example: disambiguate “plant” • Radiation from the crippled nuclear plant in Japan is showing up in rain in the United States.

  42. Limitations of supervised WSD • Practical issue in applying the algorithm to WSD: need a corpus tagged for word senses • If you have a large corpus fully annotated, WSD would be easy enough • But producing such a corpus would be extremely laborious • Senseval and other corpora only provide partial coverage • Another problem: each word is a unique disambiguation problem • Later: apply this algorithm in a semi-supervised setting (Yarowsky 1995)

  43. Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4

  44. Bootstrapping http://www.theproducersperspective.com/wp-content/uploads/2012/09/N_Bootstraps00094.jpg http://www.cerebralmastication.com/wp-content/uploads/2010/06/boot.jpg http://s3.amazonaws.com/spnproduction/photos/000/008/203/8203_2d878faf73_small.png?1342796341 http://desktopreality.com/wp-content/uploads/2012/03/bootstrap.jpg

  45. Yarowsky 1995 • “Unsupervised word sense disambiguation rivaling supervised methods” • Actually semi-supervisedbootstrapping: • Very small amount of human-annotated data • Iterative procedure for label propagation over unlabeled data set

  46. One sense per discourse hypothesis • “Words tend to exhibit only one sense in a given discourse or document”

  47. Step 1: identify all examples of target word • Store contexts in initial untagged training set

  48. Step 2: tag a small number of examples • For each sense, label a small set of training examples by hand • Do this according to seed words (features) • Example, “plant”: • manufacturing vs. life

  49. Sample initial state after step 2