LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing • Lecture 21 • 4/3/2013

Recommended reading • Banko & Brill. 2001. Scaling to very very large corpora for natural language disambiguation. • Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. Proceedings of COLT. • Best 10-year paper, awarded in 2008 • Thorsten Joachims. 1999. Transductive inference for text classification using Support Vector Machines. ICML. • Best 10-year paper, awarded in 2009 • David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. ACL. • Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. EMNLP.

Outline • Data quantity vs. performance in supervised classification • Combining supervised and unsupervised learning • Transductive SVM • Co-training • Review of decision list • Bootstrapping Word Sense Disambiguation • Decision list co-training for Named Entity Recognition • Programming Assignment #4

Data quantity vs. performance • NLP: fully annotated data sets for testing machine learning algorithms • WSJ (1.3 million words) • Brown corpus (1 million words) • Prague dependency treebank (2 million words) • What happens when we train on much larger data sets?

Banko & Brill 2001 • “Confusion set disambiguation” • { principle, principal } { then, than } • { to, two, too } { weather, whether } • Corpus generation • Replace all occurrences by a marker • The school hired a new principal  • The school hired a new PRINCIPLE/PRINCIPAL. • Easy to generate data sets for very large corpora • 1 billion word corpus • Task • Algorithm must choose correct word • Similar to word sense disambiguation

Banko & Brill 2001 • Algorithms tested • Winnow • Perceptron • Naive Bayes • Memory-based / nearest-neighbor • Features: • Words within a window • POS n-grams • Word n-grams

Performance comparison

Banko & Brill 2001 • Conclusions: • Quantity of data positively effects performance (more data is better) • Relative performance of different algorithms differs depending on amount of training data (this is disturbing; makes standardized test sets for algorithm comparison seem less meaningful)

Annotated data is expensive($ and time) • (from J. Zhu via A. Blum)

Data, annotation, and performance • Not much labeled data (annotated) • Lots of unlabeled data (unannotated) • Limited level of performance from training on labeled data only • Can we use unlabeled data to improve performance?

Utilizing unlabeled data • Easy to collect unlabeled data • Existing corpora containing billion(s) of words • Internet • Unlabeled data: • Missing the most important information (labels) • But there are other statistical regularities that can be exploited

Amount of supervision • Supervised learning: • Given a sample of object-label pairs (xi , yi), find the predictive relationship between objects and labels • Unsupervised learning: • Discover structures in unlabeled data • Semi-supervised learning: use both labeled and unlabeled data • Supervised learning + additional unlabeled data • Unsupervised learning + additional labeled data (“bootstrapping”)

Semi-supervised learning algorithms and applications • Supervised learning + additional unlabeled data • Transductive SVM • Co-training • Web page classification • Unsupervised learning + additional labeled data (“bootstrapping”) • Yarowsky algorithm; bootstrapping with seeds • Word sense disambiguation • Co-training with decision list • Named Entity Recognition

Inductive vs. transductive SVM • Inductive • Find max margin hyperplane on training set • Standard SVM algorithm • Transductive • Useful when only a small amount of data is labeled • Goal is really to minimize error on test set • Take testing data into account when finding max margin hyperplane

Inductive vs. transductive SVM • Transductive SVM has better performance than standard SVM Transductive SVM max margin hyperplane Inductive SVM max margin hyperplane Additional unlabeled data points; assign to nearest class from labeled data

Co-training: Blum & Mitchell 1998 • Combines 2 ideas • Semi-supervised training • Small amount of labeled data • Larger amount of unlabeled data • Use two supervised classifiers simultaneously • Outperforms a single classifier

Example problem • Collected a data set • 1051 web pages from CS departments at 4 universities • Manually labeled as + or - + is a course home page (22% of web pages) - is not a course home page (rest of web pages) • Use Naïve Bayes to classify web pages

Prof. Avrim Blum My Advisor Prof. Avrim Blum My Advisor x1- Link info x - Link info & Text info x2- Text info Features for web page classification • x1: text in hyperlinks (bag of words) <a href = … >CS 100, Fall semester</a> • x2: text in the page (bag of words) <html>Exam #1</html> • Training instances contain both features: x = (x1, x2)

Views • A sufficient set of features is called a view • Each view by itself is sufficient to produce an optimal classifier • For web page example, pages can be classified accurately with either text or hyperlinks • Two views are conditionally independent (given the label) • p(x1|x2, y) = p(x1|y) • p(x2|x1, y) = p(x2|y)

Co-Training algorithm • Start with small portion of labeled data • Train two classifiers from the same pool of data; each classifier is based on a different “view” • Use the two classifiers to label the data set • Data points that are classified with high confidence are added to pool of labeled data • Amount of labeled data gradually increases until it covers the entire data set

Co-Training algorithm

Error rate in classifying web pages • Combined classifier • Supervised: combine features with Naïve Bayes: p(cj|x1,2) = p(cj|x1)p(cj|x2) • Co-training: use both page-based and hyperlink-based

Co-training: error rate vs. # iterations Baseline: always predict “not a course web page” Page-based classifier Hyperlink-based classifier

Decision List for WSD: p(sense|feature) • The decision list compares the conditional probabilities of senses given various features, to determine the probabilistically most likely sense for a word. • Example: disambiguate ‘bank’ in this sentence: • I checked my boat at the marina next to the bank of the river near the shore. • p( money-sense | ‘check’ ) • p( river-sense| ‘check’ ) • … • p( money-sense | ‘shore’ ) • p( river-sense | ‘shore’ )  let’s say this has highest prob

Automatically build disambiguation system • Yarowsky’s method: • Get corpus with words annotated for different categories • Formulate templates for generation of disambiguating rules • Algorithm constructs all such rules from a corpus • Algorithm selects relevant rules through statistics of usage for each category • Methodology can be applied to any binary disambiguation problem

Rule templates Possible rules + Ranked rules annotated corpus Statistics of usage

Decision list algorithm: step 1, identify ambiguities • Example problem: accent restoration

Step 2: Collect training contexts • Begin with an annotated corpus • (In this context, a corpus with accents indicated)

Step 3: Specify rule templates • Given a particular training context, collect: • Word immediately to the right (+1 W) or left (-1 W) • Word found in ±k word window • Pair of words at fixed offsets • Other evidence can be used: • Lemma (morphological root) • Part of speech category • Other types of word classes (e.g. set of days of week)

Step 4a: Count frequency of rules for each category

Step 4b: Turn rule frequencies into probabilities

Which rules are best for disambiguating between categories? • Use absolute value of log-likelihood ratio: abs(log( p(sense1 | rule) / p(sense2 | rule) )) • Rank rules by abs. value of log-likelihood ratio • Rules that best distinguish between the two categories are ranked highest

Step 5: Choose rules that are indicative of categories: sort by abs(LogL) • This is the final decision list

Step 6: classify new data with decision list • For a sentence with a word to be disambiguated: • Go down the ranked list of rules in the decision list • Find the first rule with a matching context • Assign a sense according to that rule • Finished. • Ignore other lower-ranked rules, even if they have matching contexts as well

Example: disambiguate “plant” • Radiation from the crippled nuclear plant in Japan is showing up in rain in the United States.

Limitations of supervised WSD • Practical issue in applying the algorithm to WSD: need a corpus tagged for word senses • If you have a large corpus fully annotated, WSD would be easy enough • But producing such a corpus would be extremely laborious • Senseval and other corpora only provide partial coverage • Another problem: each word is a unique disambiguation problem • Later: apply this algorithm in a semi-supervised setting (Yarowsky 1995)

Bootstrapping http://www.theproducersperspective.com/wp-content/uploads/2012/09/N_Bootstraps00094.jpg http://www.cerebralmastication.com/wp-content/uploads/2010/06/boot.jpg http://s3.amazonaws.com/spnproduction/photos/000/008/203/8203_2d878faf73_small.png?1342796341 http://desktopreality.com/wp-content/uploads/2012/03/bootstrap.jpg

Yarowsky 1995 • “Unsupervised word sense disambiguation rivaling supervised methods” • Actually semi-supervisedbootstrapping: • Very small amount of human-annotated data • Iterative procedure for label propagation over unlabeled data set

One sense per discourse hypothesis • “Words tend to exhibit only one sense in a given discourse or document”

Step 1: identify all examples of target word • Store contexts in initial untagged training set

Step 2: tag a small number of examples • For each sense, label a small set of training examples by hand • Do this according to seed words (features) • Example, “plant”: • manufacturing vs. life

Sample initial state after step 2

LING / C SC 439/539 Statistical Natural Language Processing