1 / 10

Learning for Structured Prediction Overview of the Material

Learning for Structured Prediction Overview of the Material. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A A A A A A A A. Outline. Type of structures considered Generative vs Discriminative

nhi
Download Presentation

Learning for Structured Prediction Overview of the Material

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning for Structured PredictionOverview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAAAA

  2. Outline • Type of structures considered • Generative vs Discriminative • Global discriminative vs local discriminative • Decoding: • at testing vs at learning • methods for decoding • Predefined features vs latent features • I will use red italic to have illustration of methods; oversimplify some points

  3. Types of Structures • Sequences: • Chain CRFs, HMMs, (chain type) M3Ns, .... • Trees: • Constituency trees: weighted CFGs (including LA-PCFGs), left-corner/shift-reduce parsers (the MaxEnt parser, ISBN parser,...) • Dependency structures: MST-parser, Nivre’s shift reduce parser, ... • Rankings • Prank (today) • Not considered: DAGs (e.g., some semantic representations), Bipartite graphs (machine translation), or more general graphs ...

  4. Generative vs Discriminative • Discriminative: CRFs, MEMM, Structured Perceptron, Max-Margin Markov Networks (M3Ns),... • Learn mapping from to , so that expected error is minimal • Pros: • model what you actually care about • complex features of x are easy to integrate • different errors can be considered • less assumptions (and therefore, better asymptotic performance) • Generative • Score how likely is the combination of input and output • Pros: • easier to learn (if everything is observable – ML parameters are normalized counts) • “cleaner” semi-supervised learning , select to maximize • often, better with small datasets • some approaches care about (speech recognition, statistical machine translation,...) • arguably, preferable with latent variables HMMs, PCFGs (including the LA-PCFGs), ...

  5. Global Discr. vs Local Discr. • Local (distribs over small decsions) MEMMs, SVM decision classifiers in Nivre’s shift reduce parser • Pros: • no real decoding at training time (cheap learning) • complex features of can be integrated easily (about training! still need to decode at testing) • Cons: • mismatch btw test and train modes: rely on true features in training and on predicted ones in testing • label bias (cannot dump a unlikely transition if the number of outgoing states is not sufficiently large) • Global (distribs over the entire sequences) structperceptron, CRFs, M3Ns (model: MST parser) • Pros • Theoretically much cleaner and in practice works better • Cons • Decoding at training time (+ partition function for CRFs); but approximate learning methods exist • Learning can be very problematic if complex features of are used • Both models require decoding at testing. Decoding does not really depend on the training criteria but on the features of

  6. Specific learning criteria • CRFs • Maximize • Perceptron • Ensure separability on the training set (with large margin in some variations – e.g., ALMA): rank correct structure above incorrect one • Max-Margin Markov Networks (M3Ns) • Separate training set with maximal margin (sensitively to the error) • For every labeled example • where is any structure, is some loss function (e.g., Hamming distance for sequence measuring how many labels do not match) • “Wrong sequences with small errors should be penalized less than with more errors” • SVM-Struct, Boosting, ....

  7. Decoding at training vs testing: examples • Different combinations are possible ....

  8. Inference (argmax) • Simple dependencies in y: • Viterbi to find the most likely sequence (or, Chi-Liu-Edmonds for MST) • Or, marginal decoding to find the most likely label for every “position” • Complex dependencies: • Beam or greedy search (or some smarter search methods) • Reformulate the inference problems as a integer linear program and use methods known in ILP • (We do not care here when the inference is used: either at training or testing, or at both)

  9. Latent Variables vs Explicit Features • Explicit features: • Pros: • Mostly convex optimization (no local minima) • Cheaper to learn • Cons: • Models is as good as the features are: extensive feature engineering needed • Non local dependencies in y are often necessary • Latent variable models: • Pros: • Learn how to propagate relevant information (learns complex features from simple ones) • Can learn a model with simple decompositions over extended y -- efficient decoding • Latent representation (e.g., extended parsing states or extended grammar) can potentially be useful in other tasks – multi-task learning • Cons: • Non-convex optimization – need to avoid local minima (tricky) • More expensive to train Most of the model we considered: CRFs, MEMMs, etc LA-PCFGs, ISBNs

  10. Last bits • Term paper: due Mar 31 but send me ideas, outlines, draft well before the deadline (soon!) • Feedback on the content would be very much appreciated (as I am preparing a lecture class with a similar set of topics) • Thanks for participating!!!

More Related