Web Document Clustering: A Feasibility Demonstration

Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01

Motivation Low precision of Web search engines—hard for users to locate expected information quickly… Solutions: • Increase precision– by filtering methods? by advanced pruning options?… • Web Document Clustering  - Cluster documents returned by search engine in response to a query and re-present them

Key Requirementsfor Web Document Clustering • Relevance • Browsable Summaries • Overlap • Snippet-tolerance • “snippet”: small piece of info. Or brief extract • Speed • Incrementality

Suffix Tree Clustering(STC) • STC is a linear time clustering algorithm that is based on a suffix tree which efficiently identifies sets of documents that share common phrases. • STC satisfies the key requirements: • STC treats a document as a string, making use of proximity information between words. • STC is novel, incremental, and O(n) time algorithm. • STC succinctly summarizes clusters’ contents for users. • Quickbecause of working on smaller setof documents, incremantality • …

Operating procedure of STC • Step1: Document “cleaning” • Html -> plain text • Words stemming • Mark sentence boundaries • Remove non-word tokens • Step 2: Identifying Base Clusters • Step3: Combining Base Clusters

Step2:Identifying base Clusters—Suffix Tree * STC treats a document as a set of strings… • Suffixtree of string S: a compact tree containing all the suffixes of S • Suffix of a word: lovely • Suffix of a string: “Friends” is a lovely show. • Precise definition: • A suffix tree is a rooted, directed tree. • Each internal node has 2+ children. • Each edge is labeled with a non-empty sub-string of S. The label of a node is defined to be the concatenation of the edge-labels on the path from the root to that node • No two edges out of the same node can have edge-labels that begin with the same word—compact.

Ex. A Suffix Tree of Strings • String1: “cat ate cheese”, • String2: “mouse ate cheese too” • String3: “cat ate mouse too”

Base clusters Base clusters corresponding to the suffix tree nodes

Cluster score • s(B) = |B| * f(|P|) • |B| is the number of documents in base cluster B • |P| is the number of words in P that have a non-zero score • zero score words: stopwords, too few(<3) or too many( >40%)

Step 3:Combining Base Clusters • Merge base clusters with a high overlap in their document sets • documents may share multiple phrases. • Similarity of Bm and Bn(0.5 is paramter) 1 iff | Bm  Bn| / | Bm | > 0.5 = and | Bm  Bn| / | Bn | > 0.5 0 otherwise

Base Cluster Graph Node: cluster Edge: similarity between two clusters > 1 What if “ate” is in the stop word list?

STC is Incremental • As each document arrives from the web, we • “clean” it (linear with collection size) • Add it to the suffix tree. Each node that is updated/created as a result of this is tagged(linear) • Update the relevant base clusters and recalculate the similarity of these base clusters to the rest of k highest scoring base clusters(linear) • Check any changes to the final clusters(linear) • Score and sort the final clusters, choose top 10...(linear)

STC allows cluster overlap… • Why overlap is reasonable? a document often has 1+ topics • STC allows a document to appear in 1+ clusters, since documents may share 1+ phrases with other documents • But not too similar to be merged into one cluster..

Experiments • Cluster output of meta search engine, using STC alg. • Representative of Web search engines • WEB clustering, instead of “IR corpus”

Evaluation-Precision • Precision of different Clustering algorithm

Cluster overlap & multi-word phrases are critical to STC’s success

Cluster overlap & multi-word phrases are specifically effective to STC’s success

Why? • Allowing a document to appear in multiple clusters is only advantageous if that document is relevant; placing an irrelevant document in multiple clusters can only hurt cluster quality

Snippets versus Whole Document

Execution time • Incremental – use “free” CPU time when the system is waiting for the search engine results to arrive over the web – speedy

Conclusion • The identification of the unique requirements of document clustering of Web seach engine results • The definition of STC – an incremental, o(n) time clustering algorithm that satisfies these requirements • The first experimental evaluation of clustering algorithms on Web search engine results

Web Document Clustering: A Feasibility Demonstration