Document clustering
Download
1 / 11

Document Clustering - PowerPoint PPT Presentation


  • 151 Views
  • Uploaded on

Document Clustering. Content: Document Clustering Essentials. Text Clustering Architecture Preprocessing Different Document Models Probabilistic model Vector space model (VSM) Ontology-based VSM Document Representation Ontology and Semantic Enhancement of Presentation Models

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Document Clustering' - sofia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Document clustering

Document Clustering

Content:

Document Clustering Essentials.

Text Clustering Architecture

Preprocessing

Different Document Models

Probabilistic model

Vector space model (VSM)

Ontology-based VSM

Document Representation

Ontology and Semantic Enhancement of Presentation Models

Ontology-based VSM



Text clustering architecture
Text Clustering Architecture

Text Documents

Clustering Algorithms

Preprocessing

VSM Data Modeling

Clustering

Clusters

Ontology

Tf-Idf0

BOW

(Bag of Visual Words)


Document clustering

Preprocessing

  • Removing Stopwords

  • Stopwords

    • Function words and connectives

    • Appear in a large number of documents and have little use in

    • describing the characteristics of documents.

  • Example

  • Removing Stopwords

    • Stopwords:

    • “of”, “a”, “by”, “and” , “the”, “instead”

    • Example

      • “direct prediction continuous output variable method discretizes variable kMeans clustering solves resultant classification problem”


Document clustering

Preprocessing

  • Stemming

    • Remove inflections that convey parts of speech, tense.

  • Techniques

    • Morphological analysis (e.g., Porter’s algorithm)

    • Dictionary lookup (e.g., WordNet)

  • Stems:

    • “prediction --->predict”

    • “discretizes --->discretize”

    • “kMeans ---> kMean”

    • “clustering --> cluster”

    • “solves ---> solve”

    • “classification ---> classify”

  • Example sentence

    • “direct predict continuous output variable method discretize

    • variable kMean cluster solve resultant classify problem”





Document clustering

Ontology and Semantic Enhancement of Presentation Models

  • Represent unstructured data (text documents) according to

  • ontology repository

    • Each term in a vector is a concept rather than only a word or phrase

    • Determine the similarity of documents

  • Methods to Represent Ontology

    • Terminological ontology

    • Synonyms: several words for the same concept

      • employee (HR)=staff (Administration)=researcher (R&D)

      • car=automobile

    • Homonyms: one word with several meanings

      • bank: river bank vs. financial bank

      • fan: cooling system vs. sports fan


Document clustering

Ontology-based VSM

Each element of a document vector considering ontology is represented by:

Where Xji1 is the original frequency of ti1 term in the jth document,

is the semantic similarity between ti1 term and ti2 term .

Advantage:

Consider the relationship between terms.

Introduce semantic concepts into data models.

Combine ontology with the traditional VSM.

Terms (dimensions) have semantic relationships, rather than independent


References
References

  • S.E. Robertson and K.S. Jones. Relevance weighting of search terms. Journal of the American society for Information Sciences, 27(3): 129-146, 1976.

  • Joshua Zhexue Huang1 & Michael Ng.2 & Liping Jing1,”Text Clustering: Algorithms, Semantics and Systems”,1 The University of Hong Kong, 2 Hong Kong Baptist University, PAKDD06 Tutorial, April 9, 2006, Singapore.

  • Hewijin Christine Jiau & Yi-Jen Su & Yeou-Min Lin & Shang-Rong Tsai, “MPM: a hierarchical clustering algorithm using matrix partitioning method for non-numeric data”, J IntellInfSyst (2006) 26: 185–207, DOI 10.1007/s10844-006-0250-2.

  • Michael W. Berry and MaluCastellanos; urvey of Text Mining: Clustering, Classification, and Retrieval, Second Edition;Springer; 2007;

  • A. Hotho, S. Staab, and G.Stumme, Text Clustering based on

  • background knowledge, TR425, AIFB, German, 2003