Integrated Instance- and Class-based Generative Modeling for Text Classification

Integrated Instance- and Class-based Generative Modeling for Text Classification

121 Views

Download Presentation
## Integrated Instance- and Class-based Generative Modeling for Text Classification

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Integrated Instance- and Class-based Generative Modeling for**Text Classification Antti PuurulaUniversity of Waikato Sung-HyonMyaeng KAIST 5/12/2013 Australasian Document Computing Symposium**Instance vs. Class-based Text Classification**• Class-based learning • Multinomial Naive Bayes, Logistic Regression, Support Vector Machines, … • Pros: compact models, efficient inference, accurate with text data • Cons: document-level information discarded • Instance-based learning • K-Nearest Neighbors, Kernel Density Classifiers, … • Pros: document-level information preserved, efficient learning • Cons: data sparsityreduces accuracy**Instance vs. Class-based Text Classification 2**• Proposal: Tied Document Mixture • integrated instance- and class-based model • retains benefits from both types of modeling • exact linear time algorithms for estimation and inference • Main ideas: • replace Multinomial class-conditional in MNB with a mixture over documents • smooth document models hierarchically with class and background models**Multinomial Naive Bayes**• Standard generative model for text classification • Result of simple generative assumptions • Bayes • Naive • Multinomial**Tied Document Mixture**• Replace Multinomial in MNB by a mixture over all documents • , where documents models are smoothed hierarchically • , where class models are estimated by averaging the documents**Tied Document Mixture 3**• Can be described as constraints on a two-level mixture • Document level mixture: • Number of components= • Components assigned to instances • Component weights= • Word level mixture: • Number of components= (hierarchy depth) • Components assigned to hierarchy • Component weights= , , and**Tied Document Mixture 3**• Can be described as constraints on a two-level mixture • Document level mixture: • Number of components= • Components assigned to instances • Component weights= • Word level mixture: • Number of components= (hierarchy depth) • Components assigned to hierarchy • Component weights= , , and**Tied Document Mixture 3**• Can be described as constraints on a two-level mixture • Document level mixture: • Number of components= • Components assigned to instances • Component weights= • Word level mixture: • Number of components= (hierarchy depth) • Components assigned to hierarchy • Component weights= , , and**Tied Document Mixture 4**• Can be described as a class-smoothed Kernel Density Classifier • Document mixture equivalent to a Multinomial kernel density • Hierarchical smoothing corresponds to mean shift or data sharpening with class-centroids**Hierarchical Sparse Inference**• Reduces complexity from to • Same complexity as K-Nearest Neighbors based on inverted indices (Yang, 1994)**Hierarchical Sparse Inference 2**• Precompile values: • decomposes: • Store and in inverted indices**Hierarchical Sparse Inference 3**• Compute first • Update by • Update by to get • Compute • Bayes rule**Hierarchical Sparse Inference 3**• Compute first • Update by • Update by to get • Compute • Bayes rule**Hierarchical Sparse Inference 3**• Compute first • Update by • Update by to get • Compute • Bayes rule**Hierarchical Sparse Inference 3**• Compute first • Update by • Update by to get • Compute • Bayes rule**Hierarchical Sparse Inference 3**• Compute first • Update by • Update by to get • Compute • Bayes rule**Hierarchical Sparse Inference 3**• Compute first • Update by • Update by to get • Compute • Bayes rule**Hierarchical Sparse Inference 2**• Compute first • Update by • Update by to get • Compute • Bayes rule**Experimental Setup**• 14 classification datasets used: • 3 spam classification • 3 sentiment analysis • 5 multi-class classification • 3 multi-label classification • Scripts and datasets in LIBSVM format: • http://sourceforge.net/projects/sgmweka/**Experimental Setup 2**• Classifiers compared: • Multinomial Naive Bayes (MNB) • Tied Document Mixture (TDM) • K-Nearest Neighbors(KNN) (Multinomial distance, distance-weighted vote) • Kernel Density Classifier (KDC) (Smoothed multinomial kernel) • Logistic Regression (LR, LR+) (L2-regularized) • Support Vector Machine (SVM, SVM+) (L2-regularized L2-loss) • LR+ and SVM+ weighted feature vectors by TFIDF • Smoothing parameters optimized for MicroFscore on held-out development sets using Gaussian Random Searches**Results**• Training times for MNB, TDM, KNN and KDC linear • At most 70 s for MNB on for OHSU-TREC, 170 s for the others • SVM and LR require iterative algorithms • At most 936 s, for LR on Amazon12 • Did not scale to multi-label datasets in practical times • Classification times for instance-based classifiers higher • At most mean 226 ms for TDM on OHSU-TREC, compared to 70 ms for MNB • (with 290k terms, 196k labels, 197k documents)**Results 2**TDM significantly improves on MNB, KNN and KDC Across comparable datasets, TDM is on par with SVM+ • SVM+ is significantly better on multi-class datasets • TDM is significantly better on spam classification**Results 2**TDM significantly improves on MNB, KNN and KDC Across comparable datasets, TDM is on par with SVM+ • SVM+ is significantly better on multi-class datasets • TDM is significantly better on spam classification**Results 3**TDM reduces classification errors compared to MNB by: >65% in spam classification >26% in sentiment analysis Some correlation between error reduction and number of instances/class. Task types form clearly separate clusters**Conclusion**• Tied Document Mixture • Integrated instance- and class-based model for text classification • Exact linear time algorithms, with same complexities as KNN and KDC • Accuracy substantially improved over MNB, KNN and KDC • Competitive with optimized SVM, depending on task type • Many improvements to the basic model possible • Sparse inference scales to hierarchical mixtures of >340k components • Toolkit, datasets and scripts available: • http://sourceforge.net/projects/sgmweka/**Sparse Inference**• Sparse Inference (Puurula, 2012) • Use inverted indices to reduce complexity of computing joint for a given • Instead of computing as dot products, compute and update by for each from the inverted index • Reduces joint inference time complexity from dense to**Sparse Inference 2**• Dense representation: = number of features Time complexity: = number of classes**Sparse Inference 3**• Sparse representation: Time complexity: