Document clustering
This presentation is the property of its rightful owner.
Sponsored Links
1 / 11

Document Clustering PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on
  • Presentation posted in: General

Document Clustering. Content: Document Clustering Essentials. Text Clustering Architecture Preprocessing Different Document Models Probabilistic model Vector space model (VSM) Ontology-based VSM Document Representation Ontology and Semantic Enhancement of Presentation Models

Download Presentation

Document Clustering

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Document clustering

Document Clustering

Content:

Document Clustering Essentials.

Text Clustering Architecture

Preprocessing

Different Document Models

Probabilistic model

Vector space model (VSM)

Ontology-based VSM

Document Representation

Ontology and Semantic Enhancement of Presentation Models

Ontology-based VSM


Document clustering essentials

Document Clustering Essentials


Text clustering architecture

Text Clustering Architecture

Text Documents

Clustering Algorithms

Preprocessing

VSM Data Modeling

Clustering

Clusters

Ontology

Tf-Idf0

BOW

(Bag of Visual Words)


Document clustering

Preprocessing

  • Removing Stopwords

  • Stopwords

    • Function words and connectives

    • Appear in a large number of documents and have little use in

    • describing the characteristics of documents.

  • Example

  • Removing Stopwords

    • Stopwords:

    • “of”, “a”, “by”, “and” , “the”, “instead”

    • Example

      • “direct prediction continuous output variable method discretizes variable kMeans clustering solves resultant classification problem”


Document clustering

Preprocessing

  • Stemming

    • Remove inflections that convey parts of speech, tense.

  • Techniques

    • Morphological analysis (e.g., Porter’s algorithm)

    • Dictionary lookup (e.g., WordNet)

  • Stems:

    • “prediction --->predict”

    • “discretizes --->discretize”

    • “kMeans ---> kMean”

    • “clustering --> cluster”

    • “solves ---> solve”

    • “classification ---> classify”

  • Example sentence

    • “direct predict continuous output variable method discretize

    • variable kMean cluster solve resultant classify problem”


Different document models

Different Document Models


Probabilistic and vector space model

Probabilistic and Vector Space Model


Document representation

Document Representation


Document clustering

Ontology and Semantic Enhancement of Presentation Models

  • Represent unstructured data (text documents) according to

  • ontology repository

    • Each term in a vector is a concept rather than only a word or phrase

    • Determine the similarity of documents

  • Methods to Represent Ontology

    • Terminological ontology

    • Synonyms: several words for the same concept

      • employee (HR)=staff (Administration)=researcher (R&D)

      • car=automobile

    • Homonyms: one word with several meanings

      • bank: river bank vs. financial bank

      • fan: cooling system vs. sports fan


Document clustering

Ontology-based VSM

Each element of a document vector considering ontology is represented by:

Where Xji1 is the original frequency of ti1 term in the jth document,

is the semantic similarity between ti1 term and ti2 term .

Advantage:

Consider the relationship between terms.

Introduce semantic concepts into data models.

Combine ontology with the traditional VSM.

Terms (dimensions) have semantic relationships, rather than independent


References

References

  • S.E. Robertson and K.S. Jones. Relevance weighting of search terms. Journal of the American society for Information Sciences, 27(3): 129-146, 1976.

  • Joshua Zhexue Huang1 & Michael Ng.2 & Liping Jing1,”Text Clustering: Algorithms, Semantics and Systems”,1 The University of Hong Kong, 2 Hong Kong Baptist University, PAKDD06 Tutorial, April 9, 2006, Singapore.

  • Hewijin Christine Jiau & Yi-Jen Su & Yeou-Min Lin & Shang-Rong Tsai, “MPM: a hierarchical clustering algorithm using matrix partitioning method for non-numeric data”, J IntellInfSyst (2006) 26: 185–207, DOI 10.1007/s10844-006-0250-2.

  • Michael W. Berry and MaluCastellanos; urvey of Text Mining: Clustering, Classification, and Retrieval, Second Edition;Springer; 2007;

  • A. Hotho, S. Staab, and G.Stumme, Text Clustering based on

  • background knowledge, TR425, AIFB, German, 2003


  • Login