Course Summary

Course Summary LING 572 Fei Xia 03/06/07

Outline • Problem description • General approach • ML algorithms • Important concepts • Assignments • What’s next?

Problem descriptions

Two types of problems • Classification problem • Sequence Labeling problem • In both cases: • A predefined set of labels: C = {c1, c2, …cn} • Training data: { (xi, yi) }, where yi2 C, and yi is known or unknown. • Test data

NLP tasks • Classification problems: • Document classification • Spam detection • Sentiment analysis • … • Sequence labeling problems: • POS tagging • Word segmentation • Sentence segmentation • NE detection • Parsing • IGT detection • …

General approach

Step 1: Preprocessing • Converting the NLP task to a classification or sequence labeling problem • Creating the attribute-value table: • Define feature templates • Instantiate feature templates and select features • Decide what kind of feature values to use (e.g., binarizing features or not) • Converting a multi-class problem to a binary problem (optional)

Feature selection • Dimensionality reduction • Feature selection • Wrapping methods • Filtering methods: • Mutual info, 2, Information gain, …. • Feature extraction • Term clustering: • Latent semantic indexing (LSI)

Multiclass  Binary • One-vs-all • All-pairs • Error-correcting Output Codes (ECOC)

Step 2: Training and decoding • Choose a ML learner • Train and test on development set, with different settings of non-model parameters • Choose the best setting for the development set • Run the learner on the test data with the best setting

Step 3: Post-processing • Label sequence  the output we want • System combination • Voting: majority voting, weighted voting • More sophisticated models

Supervised algorithms

Main ideas • kNN and Ricchio: finding the nearest neighbors / prototypes • DT and DL: finding the right group • NB, MaxEnt: calculating P(y | x) • Bagging: Reducing the instability • Boosting: Forming a committee • TBL: Improving the current guess

ML learners • Modeling • Training • Testing (a.k.a. decoding)

Modeling • NB: assuming features are conditionally independent. • MaxEnt:

Training • kNN: no training • Rocchio: calculate prototypes • DT: build a decision tree • Choose a feature and then split data • DL: build a decision list: • Choose a decision rule and then spit data • TBL: build a transformation list by • Choose a transformation and then update the current label field

Training (cont) • NB: calculate P(ci) and P(fj | ci) by simple counting. • MaxEnt: calculate the weights of feature functions by iteration. • Bagging: create bootstrap samples and learn base classifiers. • Boosting: learn base classifiers and their weights.

Testing • kNN: calculate distances between x and xi, find the closest neighbors. • Rocchio: calculate distances between x and prototypes. • DT: traverse the tree • DL: find the first matched decision rule. • TBL: apply transformations one by one.

Testing (cont) • NB: calc • MaxEnt: calc • Bagging: run the base classifiers and choose the class with highest votes. • Boosting: run the base classifiers and calc the weighted sum.

Sequence labeling problems • With classification algorithms: • Having features that refer to previous tags • Using beam search to find good sequences • With sequence labeling algorithms: • HMM • TBL • MEMM • CRF • …

Semi-supervised algorithms • Self-training • Co-training • …  Adding some unlabeled data to the labeled data

Unsupervised algorithms • MLE • EM: • General algorithm: E-step, M-step • EM for PM models • Forward-backward for HMM • Inside-outside for PCFG • IBM models for MT

Important concepts

Concepts • Attribute-value table • Feature templates vs. features • Weights: • Feature weights • Classifier weights • Instance weights • Feature values

Concepts (cont) • Maximum entropy vs. Maximum likelihood • Maximize likelihood vs. minimize training error • Training time vs. test time • Training error vs. test error • Greedy algorithm vs. iterative approach

Concepts (cont) • Local optima vs. global optima • Beam search vs. Viterbi algorithm • Sample vs. resample • Model parameters vs. non-model parameters

Assignments

Assignments • Read code: • NB: binary features? • DT: difference between DT and C4.5 • Boosting: AdaBoost and AdaBoostM2 • MaxEnt: binary features? • Write code: • Info2Vectors • BinVectors • 2 • Complete two projects

Projects • Steps: • Preprocessing • Training and testing • Postprocssing • Two projects: • Project 1: Document classification • Project 2: IGT detection

Project 1: Document classification • A typical classification problem • Data are prepared already • Feature template: word appeared in the doc • Feature value: word frequency

Project 2: IGT detection • Can be framed as a sequence labeling problem • Preprocessing: Define label set • Postprocessing: Tag sequence  spans • Sequence labeling problem  using classification algorithm with beam search • To use classification classifiers: • Preprocessing: • Define features • Choose feature values • …

Project 2 (cont) • Preprocessing: • Define label set • Define feature templates • Decide on feature values • Training and decoding • Write beam search • Postprocessing • Convert label sequence  spans

Project 2 (cont) • Presentation • Final report • A typical conference paper: • Introduction • Previous work • Methodology • Experiments • Discussion • Conclusion

Using Mallet • Difficulties: • Java • A large package • Benefits: • Java • A large package • Many learning algorithms: comparing the implementation with “standard” algorithms

Bugs in Mallet? • In Hw9, include a new section: • Bugs • Complaints • Things you like about Mallet

Course summary • 9 weeks: 18 sessions • 2 kinds of problems • 9 supervised algorithms • 1 semi-supervised algorithm • 1 unsupervised algorithm • 4 related issues: feature selection, multiclass  binary, system combination, beam search • 2 projects • 1 well-known package • 9 assignments, including 1 presentation and 1 final report • N papers

What’s the next? • Learn more about the algorithms covered in class. • Learn new algorithms: • SVM, CRF, regression algorithms, graphical models, … • Try new tasks: • Parsing, spam filtering, reference resolution, …

Misc • Hw7: due tomorrow 11pm • Hw8: due Thursday 11pm • Hw9: due 3/13 11pm • Presentation: No more than 15+5 minutes

What must be included in the presentation? • Label set • Feature templates • Effect of beam search • 3+ ways to improve the system and results on dev data (test_data/) • Best system: results on dev data and the setting • Results on test data (more_test_data/)

Grades, etc. • 9 assignments + class participation • Hw1-Hw6: • Total: 740 • Max: 696.56 • Min: 346.52 • Ave: 548.74 • Median: 559.08

Course Summary

Course Summary

Presentation Transcript

Course Summary

Course Summary

Course Summary

Course Summary

Course Summary

Course Summary

Course Summary

Course summary

Course Summary

COURSE SUMMARY

Course Summary

Course Summary

Course Summary

Course summary

Course Summary

Course Summary

COURSE SUMMARY

Course Summary

Course Summary

Course Summary

Course Summary

Course Summary