1 / 35

SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS

SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS. GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS. MINAL PATANKAR MADHURI WUDALI. DOCUMENT CLUSTERING. Process of grouping documents with similar contents in to a common cluster.

quant
Download Presentation

SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL PATANKAR MADHURI WUDALI

  2. DOCUMENT CLUSTERING Process of grouping documents with similar contents into a common cluster

  3. ADVANTAGES OF DOCUMENT CLUSTERING • If a collection is well clustered, we can search only the cluster that will contain relevant documents • Clustering also improvesbrowsing through the document collection

  4. USER INTERFACES A TOOL FOR BROWSING A TOOL FOR SEARCHING SCATTER /GATHER GROUPER DOCUMENT COLLECTION META SEARCH ENGINE CLUSTERING TRADITIONAL TEXT-BASED CLUSTERING ALGORITHM BUCKSHOT FRACTIONATION STC WORD BASED SIMILARITY PHRASE BASED SIMILARITY

  5. SCATTER /GATHER INTERFACE

  6. SCATTER /GATHER SESSION • User is presented with short summaries of a small number of document groups. • User selects one or more groups for further study • Continue this process until the individual document level

  7. Fractionation Cluster Digest Buckshot Buckshot

  8. HOW IS SCATTER/GATHER DONE? • Static offline partitioning phase Fractionation Algorithm • Online Reclustering phase Buckshot Algorithm Step 1:Group average agglomerative clustering Step 2: K-Means

  9. Clustering Hierarchical Partitional Agglomerative Divisive Hybrid K-Means Single link Group Average Link Complete Link Buckshot Fractionation

  10. HIERARCHICAL AGGLOMERATIVE CLUSTERING • Create NxN doc-doc similarity matrix • Each document starts as a cluster of size one. • Do Until there is only one cluster. – combine the two clusters with the greatest similarity – update the doc-doc matrix

  11. Example A B C D E A _ 2 7 6 4 B 2 _ 9 11 14 C 7 9 _ 4 8 D 6 11 4 _ 2 E 4 14 8 2 _ A B C D E A BE C D SC(A,BE) = 4 if we are using single link (take max) SC(A,BE) = 2 if we are using complete linkage (take min) SC(A,BE) = 3 if we are using group average (take average) Note: C - BE is now the highest link

  12. Example A BE C D A _ 3 7 6 BE 3 _ 8.5 6.5 C 7 8.5 _ 4 D 6 6.5 4 _ • COMBINING BE A C D BEC SC(C,B)=9 SC(C,E)=8 SC(C,BE)=8.5

  13. Example A BEC D A _ 5 6 BEC 5 _ 5.75 D 6 5.75 _ • COMBINING BEC D A A,D

  14. SCATTER/GATHER SESSION STAGE 1 FRACTIONATION • Corpus C is broken into N/m buckets of fixed size m>k • Apply Group average agglomerative clustering on each bucket • Generate document groups, given as input to next iteration • Repeat till ‘k’ centers remain

  15. SCATTER/GATHER SESSION STAGE 2 BUCKSHOT STEP1 : HAC • First, randomly takes sample of size sqrt(kn) • Apply the Group average agglomerative clustering till we obtain ‘k’ clusters • Return the obtained clusters

  16. SCATTER /GATHER STAGE 2 BUCKSHOT STEP2 : K -Means • Arbitrary select K documents as seeds, they are the initial centroids of each cluster. • Assign all other documents to the closest centroid • Compute the centroid of each cluster again. Get new centroid of each cluster • Repeat step2,3, until the centroid of each cluster doesn’t change.

  17. A B C D E F G H Bucket 1 Bucket 2 A B G H C D E F Group Average Agglomerative Clustering Fractionation A BG H C DE F AH BG DE CF : : : BGCFDE AH Contd…

  18. Documents in Sample A D E G Group Average Agglomerative Clustering A DE G Buckshot AG DE Assign remaining documents to these clusters using K-means

  19. GENESIS OF GROUPER

  20. GROUPER • A dynamic ,web-interface to Husky Search meta-search engine • Clusters the top retrieved results of Husky Meta search engine • Dynamically group search results into clusters • Uses STC Algorithm for Clustering

  21. Grouper’s query interface.

  22. Grouper Interface

  23. STC (Suffix Tree Clustering) • A Fast , incremental algorithm • Operates on web document- snippets. • Relies on Suffix Tree to identify common phrases • Uses the common information to create clusters

  24. WHAT IS A SUFFIX TREE? • A suffix tree is a rooted, directed tree • Each internal node has at least 2 children • Each edge is labeled with a non-empty sub-string of S. • The label of a node is the concatenation of the edge-labels on the path from the root to that node. • No two edges out of the same node can have edge-labels that begin with the same word.

  25. STEPS OF STC • Step-1: Document “Cleaning” • Step-2: Identifying Base Clusters • Step-3: Combining Base Clusters • Step-4: Score clusters

  26. DOCUMENT CLEANING • Stemming • Striping of HTML, Punctuation and numbers <html>2 Cats ate <b> cheese</b>.</html> Cat ate cheese

  27. Identifying Base Clusters • Create an inverted index of strings from the web document collection with using a suffix tree • Each node of the suffix tree represents a group of documents and a string that is common to all of them • The label of the node represents the common string • Each node represents a base cluster.

  28. 1.cat ate cheese cat ate cheese 2.mouseatecheesetoo mouse ate cheese too 3.cat atemousetoo cat ate mouse too cat ate mouse 2,3 too ate cheese too cheese ate 1,2 cheese 2,3 1,3 cheese 1,2,3 ate cheese too too mouse too cheese too cheese mouse too 1,2 too too

  29. BASE CLUSTERS IDENTIFIED!! Table 1: Six nodes and their corresponding base clusters

  30. SCORING BASE CLUSTERS Scoring clusters |B| is the number of documents in base cluster B |P| is the number of words in Phrase P S(B) = |B | . f (|P|)

  31. Combining Base Clusters |BmΛBn | > 0.5 |BmΛBn | > 0.5 |Bm| |Bn| Binary similarity measure: SIMILARITY 1 IF CONDITION SATISFIED OTHERWISE O Documents which are in both Clusters Documents in Cluster ‘n’ Documents in cluster ‘m’

  32. COMBINING THE BASE CLUSTERS 1,3 cat ate 2,3 Base cluster graph 1,2 mouse cheese 1,2,3 ate 2,3 1,2 too ate cheese

  33. STC is Incremental • As each document arrives from the web, we • “clean” it • Add it to the suffix tree. Each node that is updated/created as a result of this is tagged • Update the relevant base clusters and recalculate the similarity of these base clusters to the rest of k highest scoring base clusters • Check any changes to the final clusters • Score and sort the final clusters, choose top 10

  34. STC allows cluster overlap… • Why overlap is reasonable? a document often has 1+ topics • STC allows a document to appear in 1+ clusters, since documents may share 1+ phrases with other documents

  35. REFERENCES • http://www.math.unipd.it/~aiolli/corsi/0708/IR/Lez18.pdf • http://www.ir.iit.edu/~dagr/cs529/files/handouts/08Clustering.pdf • http://www.cs.washington.edu/research/projects/WebWare1/www/metacrawler/ • http://sils.unc.edu/research/publications/reports/TR-2007-06.pdf • http://www.ir.iit.edu/~dagr/cs529/files/handouts/08Clustering.pdf

More Related