Magic Moments: Moment-based Approaches to Structured Output Prediction

The Analysis of Patterns Magic Moments:Moment-based Approaches toStructured Output Prediction Elisa Ricci joint work with Nobuhisa Ueda, Tijl De Bie, Nello Cristianini Thursday, October 25th

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Outline • Learning in structured output spaces • New algorithms based on Z-score • Experimental results and computational issues • Conclusions

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Structured data everywhere!!! • Many problems involve highly structured data which can be represented by sequences, trees and graphs. • Temporal, spatial and structural dependencies between objects are modeled. • This phenomenon is observed in several fields such as computational biology, computer vision, natural language processing or web data analysis.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Learning with structured data • Machine learning and data mining algorithms must be able to analyze efficiently and automatically a vast amount of complex and structured data. • The goal of structured learning algorithms is to predict complex structures, such as sequences, trees, or graphs. • Using traditional algorithms to cope with problems involving structured data often implies a loss of information about the structure.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions on a new test sample x. Supervised learning • Data are available in form of examples and their associated correct answers. Training set: Hypotheses space Find s.t. Learning: Prediction:

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Classification • A typical supervised learning task is classification. Named entity recognition (NER): locate named entities in text. Entities of interest are person names, location names, organization names, miscellaneous (dates, times...) x Observed variable: word in a sentence. Multiclass classification y Label: entity tag. PP ESTUDIA YA PROYECTO LEY TV REGIONAL REMITIDO POR LA JUNTA Merida. O N N N M m m N N N O L

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions PP ESTUDIA YA PROYECTO LEY TV REGIONAL REMITIDO POR LA JUNTA Merida. O N N N M m m N N N O L Sequence labeling • Can we consider the interactions between adjacent words? • Goal: realize a joint labeling for all the words in the sentence. Sequence labeling: given an input sequence x, reconstruct the associated label sequence y of equal length. x = (x1...xn) Observed sequence: words in a sentence. y = (y1...yn) Label sequence: entity tags.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Sequence alignment Biological sequence alignment is used to determine the similarity between biological sequences. ? ACTGATTACGTGAACTGGATCCA ACTC--TAGGTGAAGTG-ATCCA Given two sequences S1, S2 S a global alignment is an assignment of gaps, so as to line up each letter in one sequence with either a gap or a letter in the other sequence. S={A,T,G,C}, S1 ,S2 S S1 ATGCTTTC S2 CTGTCGCC ATGCTTTC--- ---CTGTCGCC

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions A T G C T T T C C T G T C G C C S1 ATGCTTTC S2 CTGTCGCC x ATGCTTTC--- ---CTGTCGCC y Sequence alignment Sequence alignment: given a sequences pair x, predict the correct sequence y of alignment operations (e.g. matches, mismatches, gaps). Alignments can be represented as paths from the upper-left to the lower-right corner in the alignment graph.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions RNA secondary structure prediction RNA secondary structure prediction: given a RNA sequence, predict the most likely secondary structure. The study of RNA structure is important in understanding its functions. AUGAGUAUAAGUUAAUGGUUAAAGUAAAUGUCUUCCACACAUUCCAUCUGAUUUCGAUUCUCACUACUCAU ?

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions x GAUCGAUCGAUC S C G S A U S U S S A C S S G A U S S G e C e Sequence parsing Sequence parsing: given an input sequence x, determine the associated parse tree y given an underlying context-free grammar. Example: Context-free grammar G={V, A, R, S} V={S} set of non-terminals symbols S = {G,A,U,C}set of terminals symbols R= {S → SS | GSC | CSG | ASU | USA | e }. y

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions y1 y2 y3 Label sequencey = (y1...yn) x1 x2 x3 Observed sequence x = (x1...xn) Generative models Sequence labeling: • Traditionally HMMs have been used for sequence labeling. • Two main drawbacks: • The conditional independence assumptions are often too restrictive. HMMs cannot represent multiple interacting features or long range dependencies between the observations. • They are typically trained by maximum likelihood (ML) estimation.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions y1 y2 y3 x1 x2 x3 y x Discriminative models • Specify the probability of possible output y given an observation x(consider conditional probability P(y|x) rather than joint probability P(y,x)). • Do not require strict independence assumptions of generative models. • Arbitrary features of the observations are considered. Conditional Random Fields (CRFs) [Lafferty et al., 01]

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Learning in structured output spaces • Several discriminative algorithms have emerged recently in order to predict complex structures, such as sequences, trees, or graphs. • New discriminative approaches. • Problems analyzed: • Given a training set of correct pairs of sentences and their associated entity tags learn to extract entities from a new sentence. • Given a training set of correct biological alignments learn to align two unknown sequences. • Given a training set of corrects RNA secondary structures associated to a set of sequences learn to determine the secondary structure of a new sequence. • This is not an exhaustive list of possible applications.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Score on a new test sample x. Learning in structured output spaces • Multilabel supervised classification (Output: y = (y1...yn)). Training set: Hypotheses space Find s.t. Learning: Prediction:

Three main phases: Encoding: define a suitable feature map f(x,y). Compression: characterize the output space in a synthetic and compact way. Optimization: define a suitable objective function and use it for learning. Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Learning in structured output spaces

Encoding: define a suitable feature map f(x,y). Compression: characterize the output space in a synthetic and compact way. Optimization: define a suitable objective function and use it for learning. Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Learning in structured output spaces

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Features must be defined in a way such that prediction can be computed efficiently. The feature vector f(x,y) decomposes as sum of elementary features f(x,y) on “parts”. Parts are typically edges or nodes in graphs. A T G C T T T C C T G T C G C C is typically huge. Encoding S1 ATGCTTTC S2 CTGTCGCC

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions y1 y2 y3 x1 x2 x3 Encoding Sequence labeling: Example: CRF with HMM features In general features reflect long range interactions (when labeling xi past and future observations are taken into account). Arbitrary features of the observations are considered (e.g. spelling properties in NER).

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions A T G C T T T C C T G T C G C C #matches #mismatches #gaps 4 1 4 Encoding Sequence alignment: • 3-parameters model: • In practice more complex models are used: • 4-parameters model: affine function for gap penalties, i.e. different costs if the gap starts (gap opening penalty) in a given position or if it continues (gap extension penalty). • 211/212-parameters model:f(x,y) contains the statistics associated to the gap penalties and all the possible pairs of amino acids.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions x GAUCGAUCGAUC S C G S → SS S → GSC S → CSG S → ASU S → USA S → e. S A U S U S S A C S S G A U S S G e C e Encoding Sequence parsing: The feature vector contains the statistics associated to the occurrences of the rules. y

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions A T G C T T T C C T G T C G C C A T G C T T T C C T G T C G C C Encoding • Having defined these features, predictions can be computed efficiently with dynamic programming (DP). Sequence labeling Viterbi algorithm Sequence alignment Needleman-Wunsch algorithm Sequence parsing Cocke-Younger-Kasami (CYK) algorithm DP TABLE

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions The number N of possible output vector yk given an observation x is typically huge. To characterize the distribution of the scores its mean and its variance are considered. C and m can be computed efficiently with DP techniques. s m Computing moments

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Input:x = (x1, x2, ..., xn), p, q. p(i, 1) := 1 i if (q = x1) and (p = i), forj = 2 ton for i = 1 to M := 0 if (q = x1) and (p = i), M := 1 endfor endfor Output: y1 y2 y3 x1 x2 x3 Computing moments Sequence labeling: Recursive formula The number N of possible label sequences yk given an observation sequence x is exponential in the length of the sequences. An algorithm similar to the forward algorithm is used to compute m and C. Mean value associated to the feature which represents the emission of a symbol q at state p.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Basic idea behind recursive formulas: Mean values are computed considering: Variances are computed centering the second order moments: Computing moments

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Problem: high computational cost for large feature spaces. 1st Solution: Exploit the structure and the sparseness of the covariance matrix C. In sequence labeling for CRF with HMM features the number of different values in C is linear in the size of the observation alphabet. 2nd Solution: Sampling strategy. Computing moments Example:

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions New optimization criterion particularly suited for non-separable cases. Minimize the number of output vectors with score higher than the score of the correct pairs. Maximize the Z-score: s m Z-score

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions The Z-score can be expressed as a function of the parameters w. Two equivalent optimization problems: Z-score

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Ranking loss: An upper bound on the ranking loss is minimized: The number of output vectors with score higher than the score of the correct pairs is minimized. Z-score

Minimize the number of incorrect macrolabels y. CRFs [Lafferty et al., 01], HMSVM [Altun at al., 03], averaged perceptron [Collins 02]. Minimize the number of incorrect microlabels y. M3Ns [Taskar et al., 03], SVMISO [Tsochantaridis et al., 04]. Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Previous approaches

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Given a training set T the empirical risk associated to the upper-bound on the ranking loss is minimized. An equivalent formulation in terms of C and b is considered to solve it . SODA SODA (Structured Output Discriminant Analysis)

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions SODA • Convex optimization: • If C* is not PSD, regularization can be introduced. • Solution: simple matrix inversion . • Fast conjugate gradient methods available.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Rademacher bound • The bound shows that learning based on the upper bound on the ranking loss is effectively achieved. • The bound holds also in the case where b*and C* are estimated by sampling. • Two directions of sampling: • For each only a limited number n of incorrect outputs is considered to estimate b*and C*. • Only a finite number ℓ of input-output pairs is given in the training set. • The empirical expectation of the estimated loss (estimated by computing b*and C* by random sampling) is a good approximate upper bound for the expected loss . • The latter is an upper bound for the ranking loss , such that the Rademacher bound is also a bound on the expectation of the ranking loss.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Rademacher bound Theorem (Rademacher bound for SODA). With probability at least 1-d over the joint of the random sample T and the random samples from the output space for each that are taken to approximate the matricesb*and C*, the following bound holds for any w with squared norm smaller than c: whereby M is a constant and we assume that the number of random samples for each training pair is equal to n. The Rademacher complexity terms and decrease with and respectively, such that the bound becomes tight for increasing n and ℓ, as long as n grows faster thanlog(ℓ).

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions How to define the Z-score of a training set? Another possible approach (independence assumption): Convex optimization problem which can be solved again by simple matrix inversion. Maximizing the Z-score most linear constraints are satisfied. Z-score approach Z-score approach

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Iterative approach • One may want to impose explicitly the violated constraints. • This is again a convex optimization problem that can be solved with an iterative algorithm similar to previous approaches (HMSVM [Altun at al., 03], averaged perceptron [Collins 02]). • Eventually relax constraints (e.g. add slack variables for non separable problems).

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Moments computation Input: training setT 1: C ← ø 2: Compute bi, Ci for all i=1…ℓ 3:Compute=sum(bi), =sum(Ci) 4: Find wsolving QP. 5:Repeat 6: for i=1…ℓ do 7:Compute yi’=argmaxywTf (xi, yi) 8:if wTf (xi, yi’) >wTf (xi, ) 9:C ← C UwT(f (xi, )- f (xi, yi’))>0 } 10:Find wsolving QP s.t. C 11: endif 12: endfor 13: until C is not changed in during the current iteration. Z-score maximization Identify the most violated constraint Constrained Z-score maximization Iterative approach

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Experimental results Chain CRF with HMM features. Sequence length: 50. Training set size: 20 pairs. Test set size: 100 pairs. Comparison with SVMISO [Tsochantaridis et al., 04], Perceptron [Collins 02], CRFs [Lafferty et al., 01]. Average number of incorrect labels varying the level of noise p. Sequence labeling: artificial data.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Experimental results HMM features ( ). Noise level p=0.2. Average number of incorrect labels and computational time as function of the training set size. Sequence labeling: artificial data.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Experimental results Sequence labeling: artificial data. Chain CRF with HMM features ( ). Sequence length: 10. Training set size: 50 pairs. Test set size: 100 pairs. Level of noise p=0.2 Comparison with SVMISO [Tsochantaridis et al., 04]. Labeling error on test set and average training time as function of the observation alphabet size.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Experimental results Chain CRF with HMM features ( ). Adding constraints is not very useful when data are noisy and non linearly separable. Sequence labeling: artificial data.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Experimental results Sequence labeling: NER Spanish news wire article - Special Session of CoNLL02 300 sentences with average length of 30 words. 9 labels: non-name, beginning and continuation of persons, organizations, locations and miscellaneous names. Two sets of binary features: S1 (HMM features) and S2 (S1 and HMM features for the previous and the next word). Labeling error on test set (5-fold crossvalidation)

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Experimental results Sequence alignment: artificial sequences. • Test error (number of incorrectly aligned pairs) as function of the training set size. • Original and reconstructed substitution matrices.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Experimental results • Sequence parsing: G6 grammar in [Dowell and Eddy, 2004]. RNA sequences of five families extracted from the Rfam database [Griffiths-Jones et al., 2003] • Prediction on five-fold crossvalidation.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Conclusions • New methods for learning in structured output spaces. • Accuracy comparable with state-of-the-art techniques. • Easy to implement (DP for matrix computations and simple optimization problem). • Fast for large training set and reasonable number of features. • Mean and variance computations parallelizable for large training set. • Conjugate gradient techniques used in the optimization phase. • Three application analyzed: sequence labeling, sequence parsing and sequence alignment. • Future works: • Test the scalability of this approach using approximate techniques. • Develop a dual version with kernels.

Learning in structured output spaces Z-score Experimental results and computational issues Conclusions Thank you

Magic Moments: Moment-based Approaches to Structured Output Prediction