1 / 11

Document Clustering

Document Clustering. Content: Document Clustering Essentials. Text Clustering Architecture Preprocessing Different Document Models Probabilistic model Vector space model (VSM) Ontology-based VSM Document Representation Ontology and Semantic Enhancement of Presentation Models

sofia
Download Presentation

Document Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Clustering Content: Document Clustering Essentials. Text Clustering Architecture Preprocessing Different Document Models Probabilistic model Vector space model (VSM) Ontology-based VSM Document Representation Ontology and Semantic Enhancement of Presentation Models Ontology-based VSM

  2. Document Clustering Essentials

  3. Text Clustering Architecture Text Documents Clustering Algorithms Preprocessing VSM Data Modeling Clustering Clusters Ontology Tf-Idf0 BOW (Bag of Visual Words)

  4. Preprocessing • Removing Stopwords • Stopwords • Function words and connectives • Appear in a large number of documents and have little use in • describing the characteristics of documents. • Example • Removing Stopwords • Stopwords: • “of”, “a”, “by”, “and” , “the”, “instead” • Example • “direct prediction continuous output variable method discretizes variable kMeans clustering solves resultant classification problem”

  5. Preprocessing • Stemming • Remove inflections that convey parts of speech, tense. • Techniques • Morphological analysis (e.g., Porter’s algorithm) • Dictionary lookup (e.g., WordNet) • Stems: • “prediction --->predict” • “discretizes --->discretize” • “kMeans ---> kMean” • “clustering --> cluster” • “solves ---> solve” • “classification ---> classify” • Example sentence • “direct predict continuous output variable method discretize • variable kMean cluster solve resultant classify problem”

  6. Different Document Models

  7. Probabilistic and Vector Space Model

  8. Document Representation

  9. Ontology and Semantic Enhancement of Presentation Models • Represent unstructured data (text documents) according to • ontology repository • Each term in a vector is a concept rather than only a word or phrase • Determine the similarity of documents • Methods to Represent Ontology • Terminological ontology • Synonyms: several words for the same concept • employee (HR)=staff (Administration)=researcher (R&D) • car=automobile • Homonyms: one word with several meanings • bank: river bank vs. financial bank • fan: cooling system vs. sports fan

  10. Ontology-based VSM Each element of a document vector considering ontology is represented by: Where Xji1 is the original frequency of ti1 term in the jth document, is the semantic similarity between ti1 term and ti2 term . Advantage: Consider the relationship between terms. Introduce semantic concepts into data models. Combine ontology with the traditional VSM. Terms (dimensions) have semantic relationships, rather than independent

  11. References • S.E. Robertson and K.S. Jones. Relevance weighting of search terms. Journal of the American society for Information Sciences, 27(3): 129-146, 1976. • Joshua Zhexue Huang1 & Michael Ng.2 & Liping Jing1,”Text Clustering: Algorithms, Semantics and Systems”,1 The University of Hong Kong, 2 Hong Kong Baptist University, PAKDD06 Tutorial, April 9, 2006, Singapore. • Hewijin Christine Jiau & Yi-Jen Su & Yeou-Min Lin & Shang-Rong Tsai, “MPM: a hierarchical clustering algorithm using matrix partitioning method for non-numeric data”, J IntellInfSyst (2006) 26: 185–207, DOI 10.1007/s10844-006-0250-2. • Michael W. Berry and MaluCastellanos; urvey of Text Mining: Clustering, Classification, and Retrieval, Second Edition;Springer; 2007; • A. Hotho, S. Staab, and G.Stumme, Text Clustering based on • background knowledge, TR425, AIFB, German, 2003

More Related