1 / 17

Using DEDICOM for Completely Unsupervised Part-of-Speech Tagging

Using DEDICOM for Completely Unsupervised Part-of-Speech Tagging. NAACL-HLT 2009 Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics June 5, 2009 Peter A. Chew, Brett W. Bader Sandia National Laboratories Alla Rozovskaya University of Illinois, Urbana-Champaign.

nysa
Download Presentation

Using DEDICOM for Completely Unsupervised Part-of-Speech Tagging

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using DEDICOM for Completely Unsupervised Part-of-Speech Tagging NAACL-HLT 2009 Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics June 5, 2009 Peter A. Chew, Brett W. Bader Sandia National Laboratories Alla Rozovskaya University of Illinois, Urbana-Champaign Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy under contract DE-AC04-94AL85000.

  2. Outline • Previous approaches to part-of-speech (POS) tagging • The DEDICOM model • Testing framework • Preliminary results and discussion

  3. Approaches to POS tagging (1) • Supervised • Rule-based (e.g. Harris 1962) • Dictionary + manually developed rules • Brittle – approach doesn’t port to new domains • Stochastic (e.g. Stolz et al. 1965, Church 1988) • Examples: HMMs, CRFs • Relies on estimation of emission and transition probabilities from a tagged training corpus • Again, difficulty in porting to new domains

  4. Approaches to POS tagging (2) • Unsupervised • All approaches exploit distributional patterns • Singular Value Decomposition (SVD) of term-adjacency matrix (Schütze 1993, 1995) • Graph clustering (Biemann 2006) • Our approach: DEDICOM of term-adjacency matrix • Most similar to Schütze (1993, 1995) Advantages: • can be reconciled to stochastic approaches • like SVD and graph clustering, completely unsupervised • initial results (to be shown) appear promising

  5. Introduction to DEDICOM • DEcomposition into DIrectional COMponents • Harshman (1978) • A linear-algebraic decomposition method comparable to SVD • First used for analysis of marketing data

  6. DEDICOM – an example(domain = shampoo marketing!) Original data matrix Reduced data matrix “Loadings” matrix • DEDICOM decomposes the 8 x 8 matrix into a simplified k x k “summary” (here k = 2), and a matrix showing the loadings for each phrase in each dimension • A key assumption is that stimulus and evoked phrases are a “single set of objects”

  7. DEDICOM – algebraic details • Let X be original data matrix • Let R be reduced matrix of directional relationships • Let A be “loadings” matrix X  ARAT • Compare to SVD: X  USVT • U, V and A are all dense • But R is dense while S is diagonal, and U  V • In SVD, U and V differ; in DEDICOM, A is repeated as AT

  8. DEDICOM – application to POS tagging ‘R’ matrix Term adjacency matrix ‘A’ matrix • The assumption that terms are a “single set of objects”, whether they precede or follow, sets DEDICOM apart from SVD and other unsupervised approaches • This assumption models the fact that tokens play the same syntactic role whether we view them as the first or second element in a bigram

  9. Comparing DEDICOM output to HMM input Output of DEDICOM Input to HMM(after normalization of counts) ‘R’ matrix Transition prob. matrix ‘A’ matrix Emission prob. matrix • The output of DEDICOM is essentially a transition and emission probability matrix • DEDICOM offers the possibility of getting the familiar transition and emission probabilities without training data

  10. Validation: method 1 (theoretical) • Hypothetical example - suppose tagged training corpus exists Corpus: X: sparse matrix of bigram counts A*: term-tag counts R*: tag-adjacency counts • By definition (subject to diff. of 1 for final token): • rowsums of X = colsums of X = rowsums of A* • colsums of A* = rowsums of R* = colsums of R*

  11. Validation: method 1 (theoretical) • To turn A* and R* into transition and emission probability matrices, we simply multiply each by a diagonal matrix D where the entries are the inverses of the rowsum vector • But if the DEDICOM model is a good one, we should be able to multiply A*DR*D(A*)T to approximate the original matrix X • In this case, A*DR*D(A*)T = • This not only does approximate X, but it also captures some syntactic regularities which aren’t instantiated in the corpus (this is one reason HMM-based POS tagging is successful)

  12. Validation: method 2 (empirical) • Use a tagged corpus (CONLL 2000) as gold standard • CONLL 2000 has 19,440 distinct terms • There are 44 distinct tags in the tagset • Tabulate X matrix (solely from bigram frequencies, blind to tags) • Apply DEDICOM to ‘learn’ emission and transition probability matrices • Use these as input to a HMM; tag each token with a numerical index (one of the DEDICOM ‘dimensions) • Evaluate by looking at correlation of induced tags with gold standard tags in a confusion matrix

  13. Validation: method 2 (empirical) • Examples of DEDICOM dimensions or clusters:

  14. Validation: method 2 (empirical) • Confusion matrix: correlation with ‘ideal’ diagonal matrix = 0.494 ideally, the confusion matrix would have one DEDICOM class per ‘gold standard’ tag – either a diagonal matrix or some permutation thereof – although this assumes the gold standard is the optimal tagging scheme

  15. Conclusions • DEDICOM, like other completely unsupervised POS-tagging methods, is hard to evaluate empirically • But we believe it holds promise because: • unlike other unsupervised approaches, it can be reconciled to stochastic approaches (like HMMs) which have a successful track record • unlike traditional stochastic approaches it is truly completely unsupervised • initial objective and subjective results do appear promising

  16. Future work • We believe the key to evaluating DEDICOM, or other methods of POS tagging, is to do so within a larger system • For example, use DEDICOM to disambiguate tokens which are ambiguous w.r.t. part of speech • e.g. ‘claims’ (NN) versus ‘claims’ (VBZ) • Then use this, for example, within an information retrieval system to establish separate indices (rows in a term-by-document matrix) for disambiguated terms • Evaluate based on standard metrices such as precision; see if DEDICOM-based disambiguation results in improved precision

  17. QUESTIONS? POINTS OF CONTACT:Brett W. Bader (bwbader@sandia.gov) Peter A. Chew (pchew@sandia.gov)

More Related