1 / 18

ICA of Text Documents

ICA of Text Documents. Jaakko Peltonen jaakko.peltonen@hut.fi 26 October 2000. Based on Unsupervised Topic Separation and Keyword Identification in Document Collections: A Projection Approach Ata Kabán and Mark Girolami. 1 Introduction.

alaric
Download Presentation

ICA of Text Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICA of Text Documents Jaakko Peltonenjaakko.peltonen@hut.fi 26 October 2000 Based on Unsupervised Topic Separation and Keyword Identification in Document Collections: A Projection Approach Ata Kabán and Mark Girolami

  2. 1 Introduction • ICA : proposed as a useful technique for findingmeaningful directions in multivariate data • The objective function affects the form of potential structurediscovered • Here, the problem is partitioning and analysis of sparse multivariate data • Prior knowledge is used to derive a computationally inexpensive ICA

  3. 2 Introduction, continued • Two complementary architectures: • Skewness (asymmetry) is the right objective to optimize • The two tasks will be unified in a single algorithm • Result: - fast convergence - computational cost linear in training points separate Observeddocuments Documentprototypes separate Observedwords Topic-features

  4. term T term 1 doc 1 DT doc N 3 Data Representation • Vector space representation: document [ t1, t2, . . . , tT ]T • T = number of words in the dictionary (tens of thousands) • elements are binary indicators or frequencies sparse representation • D = term  document matrix (T  N, N = number of documents)

  5. 4 Preprocessing • Assumption: observations = noisy expansion of some denser group oflatent topics • Number of clusters or topics set a priori • K-dimensional LSA spaceUSED AS topic-concepts subspace • PCA may lose important data components:sparsity infrequent, meaningful correlation less concern • Reconstruction: D »DK=UEVT

  6. 5 Prototype Documents from a Corpus Assumption: documents = noisy linear mixture of (~independent) document prototypes • N. of prototypes = n. of topics prototypes reside in LSA-space (K dimensions) • Data projection onto right eigenvectors + variance normalizationX(1):=E-1VT(DT)=UT(K  T matrix) • Task: find mixing matrix W(1), source documents S(1) so thatX(1)=W(1)TS(1)(S(1) : K  T matrix)

  7. term T term 1 term T term 1 doc 1 topic 1 DT S(1) topic K doc N 6 Prototype Documents from a Corpus, continued • Basis vectors of topic space assumed different to separate prototypes, find independent componentsWords in documents are distributed in a positively skewed way • Search restricted to skewed (perhaps asymmetric) distributions • LSA unmixing matrix must be orthogonal ( W(1)-1=W(1)T) W(1)E-1VT

  8. 7 Prototype Documents from a Corpus, continued • Objective: Skewness measure Fisher-skewness : • Prior knowledge: small component mean projection variance restricted to unity Simplified objective G(s) ( 3rd order moment) • Prevent degenerate solutions Restrict wTw=1 for stationary points • Solve with gradient methods or iteratively {

  9. 8 Prototype Documents from a Corpus, continued • Sources positive  is positive (output sign is relevant!) • K orthonormal projection directions matrix iteration • Similar to approximate Newton-Raphson optimization(FastICA type derivation small additional term) • Computational complexity: O(2K2T + KT + 4K3)

  10. 9 Topic Features from Word Features Assumption: terms = noisy linear expansion of (~independent) concepts (topics) • Data compression:X(2):=E-1UT(D)=VT(K  N matrix) • Task: find unmixing matrix W(2), topic features S(2) so thatX(2)=W(2)TS(2)(S(2) : K  N matrix) • This time, use a Clustering criterion

  11. doc 1 doc N doc N doc 1 term 1 topic 1 S(2) D topic K term T 10 Topic Features from Word Features, continued W(2)E-1UT • Objective function (zkn indicate class of xn) • Stochastic minimization EM-type algorithm: {

  12. 11 Topic Features from Word Features, continued • Comparison approach: set of binary classifiers algorithm: • Maximizes: = skewed, monotonic increasing function of topic skskewed prior is appropriate • Variance normalized after LSA, independent topics source components aligned to orthonormal axes • Similar to previous architechture {

  13. 12 Combining the Tasks • Joint optimization problem: • Information from linear outputs and from weights are complementary: Topic clustering weight peaks representative words projections clustering information Document weight peaks clustering information prototype search projections index terms • Review the separating weights on D: W(2)TE-1UT { {

  14. 13 Combining the Tasks, continued • Whitening allows inspection but isn't practical normalize variance along the K principal directions! D' := UE-1UTD • Find new unmixing matrix to maximize W(2') G(W(2')TUTD') = ... = G(W(2')TX(2)) W(2') = W(2) • Solve the relation : W(2)TUT=S(1) W(1)TUT=S(1) • Rewrite objective: concatenate data: [UT, VT] } W(1)=W(2)=W

  15. 14 Combining the Tasks, continued • Resultant algorithm : O(2K2(T + N) + K(T + N) + 4K3) Inputs: D, K 1. Decompose D with Lanczos algorithm. Retain K first singular values. Obtain U, E, V. 2. Let X = [UT, VT] 3. Iterate until convergence: Outputs: SℝK(T+N) , WℝKK • S: [T document prototypes N topic-features], W: structure information of identified topics in the corpus

  16. 15 Simulations 1. Newsgroup data ('sci.crypt', 'sci.med', 'sci.space', 'soc.religion.christian') kei effect space peopl encrypt year nasa christian system call orbit god chip peopl dai rutger secur medic year thing govern question system church clipper ve high bibl public doctor launch question peopl find man part escrow patient scienc find comput studi engin christ medic god space kei patient christian nasa encrypt year peopl orbit secur effect rutger launch govern diseas thing dai system doctor bibl mission clipper studi christ flight chip health understand engin public call church shuttl escrow test point system de physician question scienc law 10 most representative words 10 most frequent wordsselected by algorithm conformal with human labeling people god dai space sex church man year system issu group life nasa shuttl term thing love moon design sexual year christian jpl research basi find live earth cost respons question jesu orbit human homosexu bibl christ part discuss refer read rutger gov launch fornic faith human ron dr intercours issu save venu station law Simulation 2.10 most representative words,using 5 topics and 2 document classes('sci.space', 'soc.religion.christian') I II III IV V

  17. 16 Conclusions Dependency structure of the splitting in simulation 2 sci.space soc.religion.christian space shuttle space shuttle christian christian christian design (IV) mission (III) church (I) religion (II) morality (V) • Clustering and keyword identification by ICA variant that maximizes skewness • Key assumption: asymmetrical latent prior • Joint problem solved (D and DT) 'spatio-temporal' ICA • Algorithm is linear in number of documents, O(K2N) • Fast convergence (3 - 8 steps) • Potential number of topics can be greater than indicated bya human labeler discover subtopics • Hierarchical partitioning possible (recursive binary splits)

  18. 17 Further Work x='sci.crypt', o='sci.space', ='sci.med', ·='soc.religion.christian' 1 2 3 • Study links with other methods improve flexibility • Or develop a mechanismto allow more structuredrepresentation, in a mixed or hierarchical manner • For example: build in model-estimation to the algorithm • Relax equal wk norm assumption

More Related