recent trends in text mining n.
Skip this Video
Loading SlideShow in 5 Seconds..
Recent Trends in Text Mining PowerPoint Presentation
Download Presentation
Recent Trends in Text Mining

play fullscreen
1 / 27
Download Presentation

Recent Trends in Text Mining - PowerPoint PPT Presentation

khanh
175 Views
Download Presentation

Recent Trends in Text Mining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

  2. Text Mining? • What? • Data Mining on Text Data • Why? • Information Retrieval • Confusion Set Disambiguation • Topic Distillation • How? • Data Mining

  3. Organization • Text Mining Algorithms • Jargon Used • Background • Data Modeling, • Text Classification, and • Text Clustering • Applications • Experiments {NBC, NN and ssFCM} • Further work • References

  4. Text Mining Algorithms • Classification Algorithms • Naïve Bayes Classifier • Decision Trees • Neural Networks • Clustering Algorithms • EM Algorithms • Fuzzy

  5. Jargon • DM: Data Mining • IR: Information Retrieval • NBC: Naïve Bayes Classifier • EM: Expectation Maximization • NN: Neural Networks • ssFCM: Semi-Supervised Fuzzy C-Means • Labeled Data (Training Data) • Unlabeled Data • Test Data

  6. Background: Modeling • Vector Space Model

  7. Background: Modeling • Generative Models of Data [13] : Probabilistic “to generate a document, a class is first selected based on its prior probability and then a document is generated using the parameters of the chosen class distribution” • NBC and EM Algorithms are based on this model

  8. Importance of Unlabeled Data? Provides access to feature distribution in set F using joint probability distributions D A B Labeled Data Unlabeled Data Test Data G F E C

  9. How to make use of Unlabeled Data?

  10. How to make use of Unlabeled Data?

  11. Experimental Results [1] Using NBC, EM and ssFCM

  12. Experimental Results [2] Using NBC and EM

  13. Extensions and Variants of these approaches • Authors in [6] propose a concept of Class Distribution Constraint matrix • Results on Confusion Set Disambiguation • Automatic Title Generation [7]: • Using EM Algorithm • Non-extractive approach

  14. Relational Data [9] • A collection of data with relations between entities explained is known as relational data • Probabilistic Relational Models

  15. IBM Text Analyzer [11] Decision Tree Based SAS Text Miner[12] Singular Value Decomposition Filtering Junk Email Hotmail, Yahoo Advanced Search Engines Commercial Use/Products

  16. Applications: Search Engines

  17. Vivisimo Search Engine: (www.vivisimo.com)

  18. Experiments • NBC • Naïve Bayes Classifier • Probabilistic • NN • Neural Networks • ssFCM • Semi-Supervised Fuzzy Clustering • Fuzzy

  19. Datasets (20 Newsgroups Data) • Sampling I: • Sampling II: Sampling I Vectors Data Raw Sampling II Vectors

  20. Naïve Bayes Classifier

  21. Naïve Bayes Classifier

  22. NBC Sample25 Sample30

  23. ssFCM

  24. ssFCM

  25. Further Work • Ensemble of Classifiers [16]

  26. Further Work • Knowledge Gathering from Experts • E.g. 3 class Data: Input Data {C1,C2,C3} C1 C3 C2 Test Data ? Classifier

  27. References [1] “Text Classification using Semi-Supervised Fuzzy Clustering,” Girish Keswani and L.O.Hall, appeared in IEEE WCCI 2002 conference. [2] “Using Unlabeled Data to Improve Text Classification,” Kamal Paul Nigam. [3] “Text Classification from Labeled and Unlabeled Documents using EM,” Kamal Paul Nigam et al. [4] “The Value of Unlabeled Data for Classification Problems,” Tong Zhang. [5] “Learning from Partially Labeled Data,” Martin Szummer et al. [6] “Training a Naïve Bayes Classifier via the EM Algorithm with a Class Distribution Constraint,” Yoshimasa Tsuruoka and Jun’ichi Tsujii. [7] “Automatic Title Generation using EM,” Paul E. Kennedy and Alexander G. Hauptmann. [8] “Unlabeled Data can degrade Classification Performance of Generative Classifiers,” Fabio G. Cozman and Ira Cohen. [9] “Probabilistic Classification and Clustering in Relational Data,” Ben Taskar et al. [10] “Using Clustering to Boost Text Classification,” Y.C. Fang et al. [11] IBM Text Analyzer: “A decision-tree-based symbolic rule induction system for text categorization,” D.E. Johnson et al. [12] “SAS Text Miner,” Reincke [13] “Pattern Recognition,” Duda and Hart 2000 [14] “Machine Learning,” Tom Mitchell [15] “Data Mining,” Margaret Dunham [16] http://www-2.cs.cmu.edu/afs/cs/project/jair/pub/volume11/opitz99a-html/