Dataware’s Document Clustering and Query-By-Example Toolkits

Dataware’s Document Clustering and Query-By-Example Toolkits John MunsonDataware Technologies1999 BRS User Group Conference

Document Clustering • Automatically creates clusters of similar documents • General benefit: provides an overview of the range of topics in a set • Multiple specific uses • Familiarization with database before searching • Familiarization with a result set after searching • Assistance in category definition for other uses • Category tree construction • FAQ construction

Dataware’s Clustering Toolkit • One API function • Source of documents is a BRS result set • which could be backref 0 for entire database • Can specify certain fields for analysis • Output indicates member documents for each cluster • Application can specify number and max/min size of clusters, etc. • US PTO (Patent and Trademark Office) plans to do category tree construction

How It Works • Extracts keywords from each document • using our keyword-generation library • which is also in 6.3 keyword generation load filter • Repeats these steps: • Compare document and cluster pairs using the keyword lists • How many keywords do two lists share, and how similar are their weights? • Combine the most similar pair into one cluster • Stops when n clusters remain (n is configurable)

How It Works • Output is a list of clusters, including: • a cluster quality score • Measures how cohesive the cluster is • a ranked list of keywords describing the cluster • a ranked list of member documents • Highest-ranked docs are the most “central”

Speed Tricks • Speed is a big issue in clustering • especially for interactive searching • Keyword extraction takes time • Pairwise comparisons don’t scale up well at all • Thus, we use a couple of speed tricks • One trick for database design • One trick inside the clustering function • Trick 1: Pre-generate keywords • Use the BRS 6.3 keyword generation load filter • The filter produces a keyword paragraph that looks like this...

Speed Tricks ..Keywords: compartment (187.80). mass (156.56). methylhistidine (118.12). ... • At clustering time, we don’t need to do keyword analysis • Just retrieve keyword lists from engine • Cuts execution time in half

Speed Tricks • Trick 2: Cluster a sample of the set (Cutting et al) • Create the desired number of clusters from a small sample • Then compare the remaining documents only to those few clusters, not to all other documents • Saves a huge amount of execution time • Another trick for result-set clustering: • Cluster only the top-ranked 100 to 1000 docs • A final speed note: CPU speed helps a lot • Clustering is very processor-intensive • 2x CPU speed gives almost 2x clustering speed

Query-By-Example (QBE) • Allows an example passage or document to serve as a query • Useful when we already have some text or a document about our topic • “Find more like this” • No query formulation required • QBE analyzes the text, then constructs and executes a query

Dataware’s QBE Toolkit • One API function • Source of example text can be: • a text buffer • e.g. text selected with mouse • a BRS document (or documents) from a result set • e.g. selected from a title list • Can specify certain fields for analysis • a word list with weights or occurrence counts • Output is a standard ranked document list

How It Works • Extracts keywords from the example text • using ... all together now ... our keyword-generation library, yet again • Keyword selection process likes words that: • occur frequently in the example text • are rare in the database as a whole • Getting database statistics can be done: • using field qualification - most accurate but slow • using no qualification - still good, much faster • not at all -- just use occurrence counts in example text -- fastest, but trickier

How It Works • Performs a ranked search using the keywords and their weights • Flexible fielding: • Analysis of example document(s) can use one set of BRS paragraphs • Search can use a different set • Speed trick: • Generate keyword field for database (load filter) • Field-level index it • Use it for QBE searches

That’s all, folks!

Dataware’s Document Clustering and Query-By-Example Toolkits

Dataware’s Document Clustering and Query-By-Example Toolkits

Presentation Transcript

Exploiting Wikipedia as External Knowledge for Document Clustering

Web Document Clustering

Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003

LINGO

Special Topics in Text Mining

807 - TEXT ANALYTICS

Text Document Clustering

Clustering IV

Web Search and Text Mining

Recursive Bipartite Spectral Clustering for Document Categorization

CS276A Text Retrieval and Mining

CS276

Clustering IV

Clustering

Electronic Document Access (EDA)

Document retrieval

Online Clustering of Web Search results

Query

High resolution image toolkits

XML query