175 Views

Download Presentation
## Recent Trends in Text Mining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Recent Trends in Text Mining**Girish Keswani gkeswani@micron.com**Text Mining?**• What? • Data Mining on Text Data • Why? • Information Retrieval • Confusion Set Disambiguation • Topic Distillation • How? • Data Mining**Organization**• Text Mining Algorithms • Jargon Used • Background • Data Modeling, • Text Classification, and • Text Clustering • Applications • Experiments {NBC, NN and ssFCM} • Further work • References**Text Mining Algorithms**• Classification Algorithms • Naïve Bayes Classifier • Decision Trees • Neural Networks • Clustering Algorithms • EM Algorithms • Fuzzy**Jargon**• DM: Data Mining • IR: Information Retrieval • NBC: Naïve Bayes Classifier • EM: Expectation Maximization • NN: Neural Networks • ssFCM: Semi-Supervised Fuzzy C-Means • Labeled Data (Training Data) • Unlabeled Data • Test Data**Background: Modeling**• Vector Space Model**Background: Modeling**• Generative Models of Data [13] : Probabilistic “to generate a document, a class is first selected based on its prior probability and then a document is generated using the parameters of the chosen class distribution” • NBC and EM Algorithms are based on this model**Importance of Unlabeled Data?**Provides access to feature distribution in set F using joint probability distributions D A B Labeled Data Unlabeled Data Test Data G F E C**Experimental Results [1]**Using NBC, EM and ssFCM**Experimental Results [2]**Using NBC and EM**Extensions and Variants of these approaches**• Authors in [6] propose a concept of Class Distribution Constraint matrix • Results on Confusion Set Disambiguation • Automatic Title Generation [7]: • Using EM Algorithm • Non-extractive approach**Relational Data [9]**• A collection of data with relations between entities explained is known as relational data • Probabilistic Relational Models**IBM Text Analyzer [11]**Decision Tree Based SAS Text Miner[12] Singular Value Decomposition Filtering Junk Email Hotmail, Yahoo Advanced Search Engines Commercial Use/Products**Experiments**• NBC • Naïve Bayes Classifier • Probabilistic • NN • Neural Networks • ssFCM • Semi-Supervised Fuzzy Clustering • Fuzzy**Datasets (20 Newsgroups Data)**• Sampling I: • Sampling II: Sampling I Vectors Data Raw Sampling II Vectors**NBC**Sample25 Sample30**Further Work**• Ensemble of Classifiers [16]**Further Work**• Knowledge Gathering from Experts • E.g. 3 class Data: Input Data {C1,C2,C3} C1 C3 C2 Test Data ? Classifier**References**[1] “Text Classification using Semi-Supervised Fuzzy Clustering,” Girish Keswani and L.O.Hall, appeared in IEEE WCCI 2002 conference. [2] “Using Unlabeled Data to Improve Text Classification,” Kamal Paul Nigam. [3] “Text Classification from Labeled and Unlabeled Documents using EM,” Kamal Paul Nigam et al. [4] “The Value of Unlabeled Data for Classification Problems,” Tong Zhang. [5] “Learning from Partially Labeled Data,” Martin Szummer et al. [6] “Training a Naïve Bayes Classifier via the EM Algorithm with a Class Distribution Constraint,” Yoshimasa Tsuruoka and Jun’ichi Tsujii. [7] “Automatic Title Generation using EM,” Paul E. Kennedy and Alexander G. Hauptmann. [8] “Unlabeled Data can degrade Classification Performance of Generative Classifiers,” Fabio G. Cozman and Ira Cohen. [9] “Probabilistic Classification and Clustering in Relational Data,” Ben Taskar et al. [10] “Using Clustering to Boost Text Classification,” Y.C. Fang et al. [11] IBM Text Analyzer: “A decision-tree-based symbolic rule induction system for text categorization,” D.E. Johnson et al. [12] “SAS Text Miner,” Reincke [13] “Pattern Recognition,” Duda and Hart 2000 [14] “Machine Learning,” Tom Mitchell [15] “Data Mining,” Margaret Dunham [16] http://www-2.cs.cmu.edu/afs/cs/project/jair/pub/volume11/opitz99a-html/