Learning structured ouputs

Learning structured ouputs P. Gallinari Patrick.gallinari@lip6.fr www-connex.lip6.fr University Pierre et Marie Curie – Paris – Fr NATO ASI Mining Massive Data Sets for security MMDSS - P. Gallinari

Outline • Motivation and examples • Approaches for structured learning • Generative models • Discriminant models • Search models MMDSS - P. Gallinari

Machine learning and structured data • Different types of problems • Model, classify, cluster structured data • Predict structured outputs • Learn to associate structured representations • Structured data and applications in many domains • chemistry, biology, natural language, web, social networks, data bases, etc MMDSS - P. Gallinari

Coord. Conj. noun Verb 3rd pers adverb Noun plural determiner Verb gerund Verb plural adjective Sequence labeling: POS MMDSS - P. Gallinari

PENN tag set MMDSS - P. Gallinari

adverbial Phrase Noun Phrase Noun Phrase Verb Phrase Noun Phrase Segmentation + labeling: syntactic chunking (Washington Univ. tagger) MMDSS - P. Gallinari

Segmentation + labeling: Named Entity recognition • Entities • locations, persons, organizations • Time expressions: dates, times • Numeric expression: $ amount, percentages • NEW YORK (Reuters) -Goldman Sachs Group Inc. agreed on Thursday to pay $9.3 million to settle charges related to a former economist …. Goldman's GS.N settlement with securities regulators stemmed from charges that it failed to properly oversee John Youngdahl, a one-time economist …. James Comey, U.S. Attorney for the Southern District of New York, announced on Thursday a seven-count indictment of Youngdahl for insider trading, making false statements, perjury, and other charges. Goldman agreed to pay a $5 million fine and disgorge $4.3 million from illegal trading profits. MMDSS - P. Gallinari

Information extraction MMDSS - P. Gallinari

Syntaxic Parsing (Stanford Parser) MMDSS - P. Gallinari

Document mapping problem • Problem: query heterogeneous XML databases or collections • Need to know the correspondence between the structured representations  uually made by hand • Learn the correspondence between the different sources Labeled tree mapping problem MMDSS - P. Gallinari

Others • Taxonomies • Social networks • Adversial computing: Webspam, Blogspam, … • Translation • Biology • ….. MMDSS - P. Gallinari

Is structure really useful ?Can we make use of structure ? • Yes • Evidence from many domains or applications • Mandatory for many problems • e.g. 10 K classes classification problem • Yes but • Complex or long term dependencies often correspond to rare events • Practical evidence for large size problems • Simple models sometimes offer competitive results • Information retrieval • Speech recognition, etc MMDSS - P. Gallinari

Structured learning • X, Y : input and output spaces • Structured output • y  Ydecomposes into parts of variable size • y = (y1, y2,…, yT) • Dependencies • Relations between y parts • Local, long term, global • Cost function • O/ 1 loss: • Hamming loss: • F-score: • BLEU etc MMDSS - P. Gallinari

General approach • Predictive approach: • where F : X x Y R is a score function used to rank potential outputs • F trained to optimize some loss function • Inference problem • |Y| sometimes exponential • Argmax is often intractable: hypothesis • decomposability of the score function over the parts of y • Restricted set of outputs MMDSS - P. Gallinari

Structured algorithms differ by: • Feature encoding • Hypothesis on the output structure • Hypothesis on the cost function MMDSS - P. Gallinari

Generative models Hidden Markov Models Probabilistic Context Free grammars Tree labeling model

Usual hypothesis • Features : “natural” encoding of the input • Hypothesis on the output structure : local output dependencies, Markov property • Score decomposes, e.g. sum of local cost on each subpart • Inference : usually dynamic programming MMDSS - P. Gallinari

HMMs • Sequence labeling – segmentation • Dependencies • Outputs : Markov • independence • Decoding and learning • Dynamic programming • Viterbi Argmax …. • Forward Backward • Decoding complexity O(n|Q|2) for a sequence of length n and |Q| states MMDSS - P. Gallinari

Start State space for an input sequence of size 3 • Consider a simple HMM MMDSS - P. Gallinari

Probabilistic Context Free Grammar (after Manning & Shultze) • Set of terminals {w1,…,wv} • Set of non terminals {N1,…,Nn} • N1: start symbol • Set of rules {Ni zi} with zi sequence of terminals and non terminals • To each rule is associated a probability P(Ni zi) • Special case: Chomsky Normal Form grammars • zi = wj • zi = NkNm MMDSS - P. Gallinari

S NP VP VP astronomers V V NP NP saw NP PP PP stars P NP with ears MMDSS - P. Gallinari

Nj Wp……… Wq • Notations • Sentence • Wp,q= wpwp+1…wq • Nidominates sequence Wp,q if Ni may rewrite wpwp+1…wq • Assumptions • Context free • Probability of a subtree does not depend on words outside the subtree • Independence from N.. Ancestors • The probability does not depend on nodes in the derivation outside the subtree MMDSS - P. Gallinari

N1 Nj W1…Wk-1 Wk ….…… Wl Wl+1…Wn Inside and outside probabilities • As for the forward – backward variables in HMMS, 2 probabilities may be defined • Inside • Probability of generating wk…wl starting from Nj • Outside • Probability of generating Nj and all words outside wk…wl MMDSS - P. Gallinari

Nj Np Nq Wk……… Wm Wm+1 ….....… Wl Probability of a sentence: CKY algorithm • Probability of sentence w1,n • Left Right induction on the sequence • For k = 1 .. n • For l= k+1 .. n, calculate MMDSS - P. Gallinari

Inference and learning • Inference • Similar to probability of a sentence with Max instead of S • Complexity: O(m3n3) • n = length of the sentence, m = # non terminals in the grammar • Learning • Inside – outside • Each step is O(m3n3) MMDSS - P. Gallinari

Tree generative models Classification / clustering of structured documents (Denoyer et al. 2004) Document annotation / conversion (Wisniewski et al. 2006)

Context-XML semi-structured documents <article> <hdr> <bdy> <fig> <fgc> text <sec> <st> text <p> text MMDSS - P. Gallinari

Structural probability Content probability Document model ! Scalability ! MMDSS - P. Gallinari

Document Model: Structure • Belief Networks MMDSS - P. Gallinari

Document Model: Content • Model for each node • 1st order dependency • Use of a local generative model for each label MMDSS - P. Gallinari

Final network MMDSS - P. Gallinari

Likelihood maximization Discriminant learning Logistic function Error minimization Fisher Kernel Different learning techniques MMDSS - P. Gallinari

Document mapping problem • Problem • Learn from examples how to map heterogeneous sources onto a predefined target schema • Preserve the document semantic • Sources: semistructured, HTML, PDF, flat text, etc Labeled tree mapping problem Different instances Flat text to XML HTML to XML XML to XML…. MMDSS - P. Gallinari

Document mapping problem • Central issue: Complexity • Large collections • Large feature space: 103 to 106 • Large search space (exponential) • Approach • Learn generative models of XML target documents from a training set • Decoding of unknown sources according to the learned model MMDSS - P. Gallinari

Problem formulation Given ST a target format dsin(d) an input document Find the most probable target document Learned transformation model Decoding MMDSS - P. Gallinari

General restructuration model MMDSS - P. Gallinari

Example : HTML to XML (Tree annotation) • Hypothesis • Input document • HTML tags mostly for visualization • Remove tags • Keep only the segmentation (leaves) • Annotation • Leaves are the same in the HTML and XML document • Target document model: node label depends only on its local context • Context = content, left sibling, father MMDSS - P. Gallinari

Model and training • Probability of target tree • Solve • Exact Dynamic Programming decoding • O(|Leaf nodes|3.|tags|) • Approximate solution with LASO (Hal Daume ICML 2005) • O(|Leaf nodes|.|tags||tree nodes|) MMDSS - P. Gallinari

Experiments : HTML to XML • IEEE collection / INEX corpus • 12 K documents, • Average: 500 leaf nodes, 200 int nodes, 139 tags • Movie DB • 10 K movie descriptions (IMDB) • Average: 100 leaf nodes, 35 int. nodes, 28 tags • Shakespeare 39 plays • Few doc, but: • Average: 4100 leaf nodes, 850 int nodes, 21 tags • Mini-Shakespeare • Randomly chosen 60 scenes from the plays • 85 leaf nodes, 20 int. nodes, 7 tags • For all collections ½ train, ½ test MMDSS - P. Gallinari

Performance MMDSS - P. Gallinari

MMDSS - P. Gallinari

Summary • 30 years of generative models • Hierarchical HMMs, Factorial HMMs, etc • Local dependency hypothesis • On the outputs • On the inputs • Inference and learning often use dynamic programming • Prohibitive for some/ many problems • Other methods: loopy propagation, search e.g. ,A*, .. • Cost function : joint likelihoood - decomposes MMDSS - P. Gallinari

Discriminant models Structured Percepron (Collins 2002) Large margin methods (Tsochantaridis et al. 2004, Taskar 2004)

Usual hypothesis • Joint representation of input – output Φ(x, y) • Encode potential dependencies among and between input and output • e.g. histogram of state transitions observed in training set, frequency of (xi,yj), POS tags, etc • Large feature sets (102 -> 104) • Linear score function: • Decomposability of features set (outputs) and of the loss function MMDSS - P. Gallinari

Structured Perceptron (Collins 2002) • Discriminant model based on a Perceptron variant for sequence labeling • Initially proposed for POS and Chunking • Possible extension to other structured outputs tasks • Inference: Viterbi • Encodes input and output (local) dependencies • Simplicity MMDSS - P. Gallinari

Algorithm • Training algorithm • Initialize : • Repeat n times over all training examples (x,y) • If update parameters MMDSS - P. Gallinari

Inference : DP • Restricted to 0/ 1 cost • Also • Convergence and generalization bounds (Freund & Shapire, 99 ) • # mistakes depends only on on the margin, not on the size of output space (potential candidates) MMDSS - P. Gallinari

Extension of large margin methods • 2 problems • Generalize max margin principle to other loss functions than O/1 loss • Number of constraints proportional to |Y|, i.e. potentially exponential MMDSS - P. Gallinari

SVM ISO (Tsochantaridis et al. 2004) • Extension of multi-class SVMs • Principle: MMDSS - P. Gallinari

SVM formulation non linearly separable case, 0/1 cost (1 slack var. per non linear constraint (Crammer – Singer 91)): MMDSS - P. Gallinari

Learning structured ouputs

Learning structured ouputs

Presentation Transcript

Machine-learning based Semi-structured IE

Learning structured ouputs A reinforcement learning approach

Efficient Decomposed Learning for Structured Prediction

Relational Learning, and Structured Output

Efficient Large-Scale Structured Learning

Structured learning

Better Learning Through Structured Teaching

Video Summarization via Transferrable Structured Learning

Structured Learning of Two-Level Dynamic Rankings

Structured Learning Conversations

Regional Workshop and Structured Learning Visits on

Using Structured Problem Solving for Cooperative Learning

Learning with Structured Sparsity

Machine-learning based Semi-structured IE

Structured Workplace Learning

Learning Structured Models for Phone Recognition

Structured learning

What is a Structured Learning Environment?

Structured Workplace Learning (SWL) Fundamentals Boyd Maplestone

LEARNING Structured Double Loop Self-critical egocentrism

Structured Learning Groups to Increase Literacy

Machine-learning based Semi-structured IE