130 likes | 327 Views
LORNET Theme 4. Text Mining: Fast Phrase-based Text Indexing and Matching. Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario, Canada. Web / LOR. Pattern Recognition. Text Documents Web Documents Discussion Articles . . . Programming
E N D
LORNET Theme 4 Text Mining:Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario, Canada
Web / LOR Pattern Recognition Text Documents Web Documents Discussion Articles . . . Programming Languages Data Mining Database Systems The Problem How do we judge similarity? Automatic Clustering/Grouping
Clustering Documents • Group Similar Documents Together • Maximize intra-cluster similarity • Minimize inter-cluster similarity • Need to accurately calculate document similarity
Document Similarity How similar each document is to every other document? Very time consuming! O(n2)
Document Similarity • Information Theoretic Measure (Dekang’98): • How do we intersect every pair of documents without sacrificing efficiency? • What features should we intersect? • Words • Phrases
Fast Phrase-based Document Indexingand Matching • Document Index Graph Structure • A model based on a digraph representation of the phrases in the document set • Nodes correspond to unique terms • Edges maintain phrase representation • A phrase is a path in the graph • The model is an inverted list (terms documents) • Nodes carry term weight information for each document in which they appear • Shared phrases can be matched efficiently • Phrase-based Features • Phrases: more informative feature than individual words local context matching • Represent sentences rather than words • Facilitate phrase-matching between documents • Achieves accurate document pair-wise similarity • Avoid high-dimensionality of vector space model • Allow incremental processing Document Index Graph
Document Index Graph - river - vacation plan - river rafting - river - trips
Phrase-based Document Indexing Document Index Graph (size scalability) Document Index Graph (internal structure) Document Index Graph (time performance)
Effect of using phrase-based similarity over individual words Effect of using phrase similarity (F-measure) Effect of using phrase similarity (Entropy)
Applications • Grouping search engine results on-the-fly(incremental processing) • Creating taxonomies of documents(Yahoo! and Open Directory style) • Implementing “Find Related” or “Find Similar” features of information retrieval systems • Automatic generation of descriptive phrases about a set of documents (i.e. labeling clusters) • Detecting plagiarism
Collaboration • Provide Data Mining services (primarily text mining) for other groups • Opportunity for collaboration with U of Saskatchewan: • I-Help Discussion System • Course Delivery Tools • Others are welcome
Questions • Instant Messaging • MSN Messenger: lornet_uw@hotmail.com • E-mail • lornet@pami.uwaterloo.ca