The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005

The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Learning Topic Hierarchies from Text Documents using a Scalable Hierarchical Fuzzy Clustering Method E. Mendes Rodrigues and L. Sacks {mmendes, lsacks}@ee.ucl.ac.uk http://www.ee.ucl.ac.uk/~mmendes/ Department of Electronic & Electrical Engineering University College London, UK

Outline • Document clustering process • H-FCM: Hyper-spherical Fuzzy C-Means • H2-FCM: Hierarchical H-FCM • Clustering experiments • Topic hierarchies

Pre-processing Document Similarity Clustering Method Document Clusters Pre-processing Document Similarity Clustering Method Document Clusters Document Representation Document Encoding Document Clustering Document Representation Document Encoding Document Clustering Document Collection Cluster Validity Application Document Collection Cluster Validity Application Identify all unique words in the document collection  x11 x12  x1k x21 x22 xN1 xN2 xNk    Discard common words that are included in the stop list X =  Apply stemming algorithm and combine identical word stems Document Vectors Discard terms using pre-processing filters Apply term weighting scheme to the final set of k indexing terms Document Clustering Process Vector-Space Model of Information Retrieval • Very high-dimensional • Very sparse (+95%)

Measures of Document Relationship • FCM applies the Euclidean distance, which is inappropriate for high-dimensional text clustering • non-occurrence of the same terms in both documents is handled in a similar way as the co-occurrence of terms • Cosine (dis)similarity measure: • widely applied in Information Retrieval • represents the cosine of the angle between two document vectors • insensitive to different document lengths, since it is normalised bythe length of the document vectors

H-FCM: Hyper-spherical Fuzzy C-Means • Applies the cosine measure to assess document relationships • Modified objective function: • Subject to an additional constraint: • Fuzzy memberships (u) and cluster centroids (v):

How many clusters? • Usually the final number of clusters is not know a priori • Run the algorithm for a range of c values • Apply validity measures and determine which c leads to the best partition (clusters compactness, density, separation, etc.) • How compact and dense are clusters in a sparse high-dimensional problem space? • Very small percentage of documents within a cluster present high similarity to the respective centroid  clusters are not compact • However, there is always a clear separation between intra- and inter-cluster similarity distributions

H2-FCM: Hierarchical Hyper-spherical Fuzzy C-Means • Key concepts • Apply partitional algorithm (H-FCM) to obtain a sufficiently large number of clusters • Exploit the granularity of the topics associated with each cluster to link cluster centroids hierarchically • Form a topic hierarchy • Asymmetric similarity measure • Identify parent-child type relationships between cluster centroids • Child should be less similar to parent, than parent to child

Add child Compute S(va,vb),a,b Select parent While VF ≠ Select centroid S≥tPCS? VH=? Y Y N N Add root c=c-K N VF Document Cluster centroid vbVF  S(v5 ,vb) = max[S(vi ,vj)], vi,vjVF VH v1 C2 v1 v1 v1 C1 v2 v2 v2 v2  S(v8,v5)<tPCS S(v8,v1)<tPCS S(v1,v5)≥tPCS C3 C5 v3 C4 v3 v3 v5 v5 v5 v4 v4 v4 v8 v8 v8 v8 C6 v6 v7 v6 v6 v7 v7 C8 C7 v11 v11 v11 C10 C9 v10 v10 v10 v10 v9 v9 v9 C11 v12 v12 v12 C12 The H2-FCM Algorithm All clusters have size≥tND? Apply H-FCM (c, m) Asymmetric Similarity

Scalability of the Algorithm • H2-FCM time complexity depends on H-FCM and centroid linking heuristic • H-FCM computation time is O(Nc2k) • Linking heuristic is at most O(c2k) • Computation of the asymmetric similarity between every pair of cluster centroids - O(c2k) • Generation of the cluster hierarchy - O(c2) • Overall, H2-FCM time complexity is O(Nc2k) • Scales well to large document sets!

Description of Experiments • Goal: evaluate the H2-FCM performance • Evaluation measures: clustering Precision (P) and Recall (R) • H2-FCM algorithm run for a range of c values • No. hierarchy roots=No. reference classes tPCSdynamically set • Are sub-clusters of the same topic assigned to the same branch?

Test Document Collections Reuters-21578 test collection: http://www.daviddlewis.com/resources/testcollections/reuters21578/ Open Directory Project (ODP): http://dmoz.org/ INSPEC database: http://www.iee.org/publish/inspec/

Clustering Results: H2-FCM Precision and Recall reuters1 reuters2 odp inspec

Topic Hierarchy • Each centroid vector consists of a set of weighted terms • Terms describe the topics associated with the document cluster • Centroid hierarchy produces a topic hierarchy • Useful for efficient access to individual documents • Provides context to users in exploratory information access

Topic Hierarchy Example

Concluding Remarks • H2-FCM clustering algorithm • Partitional clustering (H-FCM) • Linking heuristic organizes centroids hierarchically bases on asymmetric similarity • Scales linearly with the number of documents • Exhibits good clustering performance • Topic hierarchy can be extracted from the centroid hierarchy

The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005

The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005

Presentation Transcript

7/7/2005 London Bombing

Annual results 2005 September 2005

2005 Pacific Aviation Directors Workshop April 5-7

Annual Workshop February 5th, 2014

Annual Workshop February 5th, 2014

Computational Intelligence

2005 Pacific Aviation Directors Workshop April 5-7

UK Presidency Workshop on Patient Safety, EHFG, Bad Hofgastein,7 September 2005

Computational Intelligence

Citigroup 12 th Annual Global Technology Conference September 7, 2005

5th Annual Institute on Postdoctoral Preparation September 26, 2008

CCS Cataloger’s Workshop for September 7, 2005

London ASW Leads Network September 5th 2006

PSD 7 September 2005

CODATA workshop 5-7 September 2005

5th Annual Microturbine Applications Workshop Ottawa, Canada – January 25 – 27, 2005

The GEOSS Initiative Codata workshop, September 2005

The 2005 UK Workshop on Computational Intelligence 5-7 September 2005, London

Annual Workshop February 5th, 2014

September 7, 2005

PSD 7 September 2005

Workshop London, UK