1 / 13

Dataware’s Document Clustering and Query-By-Example Toolkits

Dataware’s Document Clustering and Query-By-Example Toolkits. John Munson Dataware Technologies 1999 BRS User Group Conference. Document Clustering. Automatically creates clusters of similar documents General benefit: provides an overview of the range of topics in a set Multiple specific uses

macey-wise
Download Presentation

Dataware’s Document Clustering and Query-By-Example Toolkits

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dataware’s Document Clustering and Query-By-Example Toolkits John MunsonDataware Technologies1999 BRS User Group Conference

  2. Document Clustering • Automatically creates clusters of similar documents • General benefit: provides an overview of the range of topics in a set • Multiple specific uses • Familiarization with database before searching • Familiarization with a result set after searching • Assistance in category definition for other uses • Category tree construction • FAQ construction

  3. Dataware’s Clustering Toolkit • One API function • Source of documents is a BRS result set • which could be backref 0 for entire database • Can specify certain fields for analysis • Output indicates member documents for each cluster • Application can specify number and max/min size of clusters, etc. • US PTO (Patent and Trademark Office) plans to do category tree construction

  4. How It Works • Extracts keywords from each document • using our keyword-generation library • which is also in 6.3 keyword generation load filter • Repeats these steps: • Compare document and cluster pairs using the keyword lists • How many keywords do two lists share, and how similar are their weights? • Combine the most similar pair into one cluster • Stops when n clusters remain (n is configurable)

  5. How It Works • Output is a list of clusters, including: • a cluster quality score • Measures how cohesive the cluster is • a ranked list of keywords describing the cluster • a ranked list of member documents • Highest-ranked docs are the most “central”

  6. Speed Tricks • Speed is a big issue in clustering • especially for interactive searching • Keyword extraction takes time • Pairwise comparisons don’t scale up well at all • Thus, we use a couple of speed tricks • One trick for database design • One trick inside the clustering function • Trick 1: Pre-generate keywords • Use the BRS 6.3 keyword generation load filter • The filter produces a keyword paragraph that looks like this...

  7. Speed Tricks ..Keywords: compartment (187.80). mass (156.56). methylhistidine (118.12). ... • At clustering time, we don’t need to do keyword analysis • Just retrieve keyword lists from engine • Cuts execution time in half

  8. Speed Tricks • Trick 2: Cluster a sample of the set (Cutting et al) • Create the desired number of clusters from a small sample • Then compare the remaining documents only to those few clusters, not to all other documents • Saves a huge amount of execution time • Another trick for result-set clustering: • Cluster only the top-ranked 100 to 1000 docs • A final speed note: CPU speed helps a lot • Clustering is very processor-intensive • 2x CPU speed gives almost 2x clustering speed

  9. Query-By-Example (QBE) • Allows an example passage or document to serve as a query • Useful when we already have some text or a document about our topic • “Find more like this” • No query formulation required • QBE analyzes the text, then constructs and executes a query

  10. Dataware’s QBE Toolkit • One API function • Source of example text can be: • a text buffer • e.g. text selected with mouse • a BRS document (or documents) from a result set • e.g. selected from a title list • Can specify certain fields for analysis • a word list with weights or occurrence counts • Output is a standard ranked document list

  11. How It Works • Extracts keywords from the example text • using ... all together now ... our keyword-generation library, yet again • Keyword selection process likes words that: • occur frequently in the example text • are rare in the database as a whole • Getting database statistics can be done: • using field qualification - most accurate but slow • using no qualification - still good, much faster • not at all -- just use occurrence counts in example text -- fastest, but trickier

  12. How It Works • Performs a ranked search using the keywords and their weights • Flexible fielding: • Analysis of example document(s) can use one set of BRS paragraphs • Search can use a different set • Speed trick: • Generate keyword field for database (load filter) • Field-level index it • Use it for QBE searches

  13. That’s all, folks!

More Related