1 / 11

Scatter/Gather : A Cluster Based Approach to Large Document Collections

Scatter/Gather : A Cluster Based Approach to Large Document Collections. Alyssa Katz LIS 551 March 23, 2003. Introduction. Alternate uses for document clustering Give document clustering a second chance!. Old Approach. Compare Document Clustering with Vector Space Models

deron
Download Presentation

Scatter/Gather : A Cluster Based Approach to Large Document Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scatter/Gather : A Cluster Based Approach to Large Document Collections Alyssa Katz LIS 551 March 23, 2003

  2. Introduction • Alternate uses for document clustering • Give document clustering a second chance!

  3. Old Approach • Compare Document Clustering with Vector Space Models • Cluster searches are for the most part inferior to VS searches • Document clustering algorithms are SLOW • CONCLUSION: Document clustering should only be used to the extent of accelerating VS searches

  4. New Approach • Document Clustering is not bad, just misunderstood • The REAL question is: How can clustering be effective in its own right? • THE ANSWER: The “Scatter/Gather Method”

  5. Specific information need User has good idea of keywords or search terms Faster, more pointed User wants more general info Is not familiar with the vocabulary, or doesn’t want to commit to a specific set of words User will sift through info to find what he wants Searching vs. Browsing

  6. Solution • Use clustering to browse a system the way one would browse a table of contents • Have a function where user can alternate between browsing and searching

  7. Scatter/Gather • User is presented with short summaries of a small number of document groups. • User selects one or more groups for further study • Continue this process until the individual document level

  8. Example • 5000 Articles in the NYT News Service International News Kuwait and Germany and Oil Articles about effect of invasion on oil market, U.S. Military deployment in Kuwait Document

  9. Requirements • New Algorithms • One that can appropriately cluster large document collections • One that can sufficiently generate summaries of these document collections

  10. Solution • Buckshot algorithm for the first requirement • Employs a random sampling of clusters • Fractionation for the second requirement

  11. Application to Scatter/Gather • Basically, clustering is done beforehand, and real time searches do not cluster from scratch • Real time searches just refine what already exists

More Related