1 / 9

Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003

Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003. Presentation Serhiy Polyakov DSCI 5240 Fall 2005. Introduction. Application of document clustering: web mining search engines information retrieval topological analysis

liko
Download Presentation

Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical Document Clusteringusing Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin EsterSDM 2003 Presentation Serhiy Polyakov DSCI 5240 Fall 2005

  2. Introduction Application of document clustering: • web mining • search engines • information retrieval • topological analysis Special requirements for document clustering: • high dimensionality • high volume of data • ease for browsing • meaningful cluster labels

  3. Problem statement Problems with some standard clustering techniques: • number of clusters is unknown • size of the clusters varies greatly Suggested approach - Frequent Itemset-based Hierarchical Clustering (FIHC): • Reduced dimensionality • High clustering accuracy • Number of clusters as an optional input parameter • Easy to browse with meaningful cluster description

  4. Algorithm FIHC preprocessing steps: • stop words removal • stemming on the document set • each document is represented by a vector of frequencies of remaining items within the document FIHC two main steps: • Constructing Initial Clusters (construct an initial cluster to contain all the documents that contain each global frequent itemset) • Making Clusters Disjoint (after this step, each document belongs to exactly one cluster)

  5. Example of Disjoined clusters The cluster label is a set of mandatory items in the cluster in that every document in the cluster must contain all the items in the cluster label

  6. Building the Cluster Tree The set of clusters produced by the previous stage can be viewed as a set of topics and subtopics in the document set. A cluster (topic) tree is constructed based on the similarity among clusters The topic of a parent cluster is more general than the topic of a child cluster and they are “similar” to a certain degree.

  7. Tree Structure vs Browsing • Deep hierarchy tree produced by other methods may not be suitable for browsing • A flat hierarchy reduces the number of navigation steps which in turn decreases the chance for a user to make mistakes • If a hierarchy is too flat, a parent topic may contain too many subtopics and it would increase the time and difficulty for the user to locate her target • A balance between depth and width of the tree is essential for browsing

  8. Evaluation • Evaluation has been performed in terms of F-measure • The following parameters have been evaluated: sensitivity to parameters, efficiency and scalability. • The following competitors have been considered: UPGMA, bisecting k-means, and frequent itemset-based algorithm (HFTC).

  9. Conclusion FIHC approach suggested in the article outperforms its competitors in terms of accuracy, efficiency, and scalability.

More Related