Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003

Hierarchical Document Clusteringusing Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin EsterSDM 2003 Presentation Serhiy Polyakov DSCI 5240 Fall 2005

Introduction Application of document clustering: • web mining • search engines • information retrieval • topological analysis Special requirements for document clustering: • high dimensionality • high volume of data • ease for browsing • meaningful cluster labels

Problem statement Problems with some standard clustering techniques: • number of clusters is unknown • size of the clusters varies greatly Suggested approach - Frequent Itemset-based Hierarchical Clustering (FIHC): • Reduced dimensionality • High clustering accuracy • Number of clusters as an optional input parameter • Easy to browse with meaningful cluster description

Algorithm FIHC preprocessing steps: • stop words removal • stemming on the document set • each document is represented by a vector of frequencies of remaining items within the document FIHC two main steps: • Constructing Initial Clusters (construct an initial cluster to contain all the documents that contain each global frequent itemset) • Making Clusters Disjoint (after this step, each document belongs to exactly one cluster)

Example of Disjoined clusters The cluster label is a set of mandatory items in the cluster in that every document in the cluster must contain all the items in the cluster label

Building the Cluster Tree The set of clusters produced by the previous stage can be viewed as a set of topics and subtopics in the document set. A cluster (topic) tree is constructed based on the similarity among clusters The topic of a parent cluster is more general than the topic of a child cluster and they are “similar” to a certain degree.

Tree Structure vs Browsing • Deep hierarchy tree produced by other methods may not be suitable for browsing • A flat hierarchy reduces the number of navigation steps which in turn decreases the chance for a user to make mistakes • If a hierarchy is too flat, a parent topic may contain too many subtopics and it would increase the time and difficulty for the user to locate her target • A balance between depth and width of the tree is essential for browsing

Evaluation • Evaluation has been performed in terms of F-measure • The following parameters have been evaluated: sensitivity to parameters, efficiency and scalability. • The following competitors have been considered: UPGMA, bisecting k-means, and frequent itemset-based algorithm (HFTC).

Conclusion FIHC approach suggested in the article outperforms its competitors in terms of accuracy, efficiency, and scalability.

Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003

Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003

Presentation Transcript

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Dynamic hierarchical algorithms for document clustering

Hierarchical Clustering

Hierarchical Clustering

Fast Algorithms for Mining Frequent Itemsets

Text clustering using frequent itemsets

Document Clustering

DOCUMENT CLUSTERING USING HIERARCHICAL ALGORITHM

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

(C) 2003 Milo Martin

Bayesian Hierarchical Clustering

Hierarchical Clustering