Using DEDICOM for Completely Unsupervised Part-of-Speech Tagging

Using DEDICOM for Completely Unsupervised Part-of-Speech Tagging NAACL-HLT 2009 Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics June 5, 2009 Peter A. Chew, Brett W. Bader Sandia National Laboratories Alla Rozovskaya University of Illinois, Urbana-Champaign Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy under contract DE-AC04-94AL85000.

Outline • Previous approaches to part-of-speech (POS) tagging • The DEDICOM model • Testing framework • Preliminary results and discussion

Approaches to POS tagging (1) • Supervised • Rule-based (e.g. Harris 1962) • Dictionary + manually developed rules • Brittle – approach doesn’t port to new domains • Stochastic (e.g. Stolz et al. 1965, Church 1988) • Examples: HMMs, CRFs • Relies on estimation of emission and transition probabilities from a tagged training corpus • Again, difficulty in porting to new domains

Approaches to POS tagging (2) • Unsupervised • All approaches exploit distributional patterns • Singular Value Decomposition (SVD) of term-adjacency matrix (Schütze 1993, 1995) • Graph clustering (Biemann 2006) • Our approach: DEDICOM of term-adjacency matrix • Most similar to Schütze (1993, 1995) Advantages: • can be reconciled to stochastic approaches • like SVD and graph clustering, completely unsupervised • initial results (to be shown) appear promising

Introduction to DEDICOM • DEcomposition into DIrectional COMponents • Harshman (1978) • A linear-algebraic decomposition method comparable to SVD • First used for analysis of marketing data

DEDICOM – an example(domain = shampoo marketing!) Original data matrix Reduced data matrix “Loadings” matrix • DEDICOM decomposes the 8 x 8 matrix into a simplified k x k “summary” (here k = 2), and a matrix showing the loadings for each phrase in each dimension • A key assumption is that stimulus and evoked phrases are a “single set of objects”

DEDICOM – algebraic details • Let X be original data matrix • Let R be reduced matrix of directional relationships • Let A be “loadings” matrix X  ARAT • Compare to SVD: X  USVT • U, V and A are all dense • But R is dense while S is diagonal, and U  V • In SVD, U and V differ; in DEDICOM, A is repeated as AT

DEDICOM – application to POS tagging ‘R’ matrix Term adjacency matrix ‘A’ matrix • The assumption that terms are a “single set of objects”, whether they precede or follow, sets DEDICOM apart from SVD and other unsupervised approaches • This assumption models the fact that tokens play the same syntactic role whether we view them as the first or second element in a bigram

Comparing DEDICOM output to HMM input Output of DEDICOM Input to HMM(after normalization of counts) ‘R’ matrix Transition prob. matrix ‘A’ matrix Emission prob. matrix • The output of DEDICOM is essentially a transition and emission probability matrix • DEDICOM offers the possibility of getting the familiar transition and emission probabilities without training data

Validation: method 1 (theoretical) • Hypothetical example - suppose tagged training corpus exists Corpus: X: sparse matrix of bigram counts A*: term-tag counts R*: tag-adjacency counts • By definition (subject to diff. of 1 for final token): • rowsums of X = colsums of X = rowsums of A* • colsums of A* = rowsums of R* = colsums of R*

Validation: method 1 (theoretical) • To turn A* and R* into transition and emission probability matrices, we simply multiply each by a diagonal matrix D where the entries are the inverses of the rowsum vector • But if the DEDICOM model is a good one, we should be able to multiply A*DR*D(A*)T to approximate the original matrix X • In this case, A*DR*D(A*)T = • This not only does approximate X, but it also captures some syntactic regularities which aren’t instantiated in the corpus (this is one reason HMM-based POS tagging is successful)

Validation: method 2 (empirical) • Use a tagged corpus (CONLL 2000) as gold standard • CONLL 2000 has 19,440 distinct terms • There are 44 distinct tags in the tagset • Tabulate X matrix (solely from bigram frequencies, blind to tags) • Apply DEDICOM to ‘learn’ emission and transition probability matrices • Use these as input to a HMM; tag each token with a numerical index (one of the DEDICOM ‘dimensions) • Evaluate by looking at correlation of induced tags with gold standard tags in a confusion matrix

Validation: method 2 (empirical) • Examples of DEDICOM dimensions or clusters:

Validation: method 2 (empirical) • Confusion matrix: correlation with ‘ideal’ diagonal matrix = 0.494 ideally, the confusion matrix would have one DEDICOM class per ‘gold standard’ tag – either a diagonal matrix or some permutation thereof – although this assumes the gold standard is the optimal tagging scheme

Conclusions • DEDICOM, like other completely unsupervised POS-tagging methods, is hard to evaluate empirically • But we believe it holds promise because: • unlike other unsupervised approaches, it can be reconciled to stochastic approaches (like HMMs) which have a successful track record • unlike traditional stochastic approaches it is truly completely unsupervised • initial objective and subjective results do appear promising

Future work • We believe the key to evaluating DEDICOM, or other methods of POS tagging, is to do so within a larger system • For example, use DEDICOM to disambiguate tokens which are ambiguous w.r.t. part of speech • e.g. ‘claims’ (NN) versus ‘claims’ (VBZ) • Then use this, for example, within an information retrieval system to establish separate indices (rows in a term-by-document matrix) for disambiguated terms • Evaluate based on standard metrices such as precision; see if DEDICOM-based disambiguation results in improved precision

QUESTIONS? POINTS OF CONTACT:Brett W. Bader (bwbader@sandia.gov) Peter A. Chew (pchew@sandia.gov)

Using DEDICOM for Completely Unsupervised Part-of-Speech Tagging

Using DEDICOM for Completely Unsupervised Part-of-Speech Tagging

Presentation Transcript

Part of Speech (POS) Tagging

Part-of-speech tagging

Part-of-Speech Tagging

Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

Part of Speech Tagging

Part-of-Speech (POS) tagging

Part-Of-Speech Tagging using Neural Networks

Distributional Part-of-Speech Tagging

Persian Part Of Speech Tagging

Part-of-Speech Tagging

Part of Speech Tagging

Part-of-Speech Tagging

Part-of-Speech Tagging

Part-of-Speech Tagging

Part-of-Speech Tagging

Part-of-Speech Tagging

Part of Speech Tagging

Part-of-speech Tagging

Part of Speech Tagging

Part-of-speech tagging

Part-of-Speech Tagging