Fast Mining of Interesting Phrases from Subsets of Text Corpora

Fast Mining of Interesting Phrasesfrom Subsets of Text Corpora Deepak P, Atreyee Dey, Debapriyo Majumdar* 1IBM Research - India, Bengaluru, INDIA EDBT 2014 Conference, Athens, Greece IBM Research - India, Bengaluru, India *presently with Indian Statistical Institute, Kolkata, India

Problem Description D’ D ukraine, crimea … Chosen Subset Text Corpus Crimea independence, 0.90 USA Russia Relations, 0.85 G8 Membership, 0.81 … … Given a text corpus D, and a subset D’, specified by a keyword query, find the top-k Interesting Phrases for D’ wrt D IBM Research – India, Bengaluru

Earlier Approaches Phrase Indexing, Simistis et al., VLDB 2008 p1 d12 d13 d30 d9901 p9876 d1 d11 d305 d8100 O(|P|) Document Indexing, Bedathur et al., VLDB 2010 and Gao & Michel, EDBT 2012 d1 p5 p43 p167 p8970 d9998 p23 p49 p305 p9987 O(|D’|) IBM Research – India, Bengaluru

Estimating Interestingness: AND Query • Consider an AND query composed of k key-words • Q = {Q1, Q2, …, Qk} IBM Research – India, Bengaluru

Query Word Independence Assumption Consider an AND Query of two words Q1 and Q2 We would like to estimate p(P1|Q1, Q2) as an estimate of the interestingness of P1 Instead, we could estimate p(Q1, Q2|P1) (as shown in previous slide) For OR Query Handling details, refer to the paper IBM Research – India, Bengaluru

Our Disk-Resident Indexes w1 p30 p12 p990 p13 0.23 0.21 0.18 0.002 w9876 p810 p11 p305 p8 0.1 0.08 0.007 0.0001 The score that is stored along with each phrase is p(w|p) All values are stored in sorted order IBM Research – India, Bengaluru

Aggregation Approach: NRA • We use the well-known NRA algorithm to do aggregation of the lists corresponding to the query words, to arrive at the top phrases • At any point, we have upper and lower bounds. An example sum-aggregation below P1 – [0.1547, 0.1547] P5 – [0.0333,0.1433] P103 – [0.26, 0.2933] … … P1, 0.04167 P5, 0.0333 w1 P103, 0.26 P1, 0.113 w2 IBM Research – India, Bengaluru

Our In-Memory Indexes w1 p12 p13 p30 p990 0.21 0.002 0.23 0.18 w9876 p8 p11 p305 p810 0.0001 0.08 0.007 0.1 The score that is stored along with each phrase is p(w|p) All values are stored in PhraseID sorted order Indexes may be created by preserving just the top-10% values of each list We will use simple Sort-Merge-Join on these lists for In-Memory operation IBM Research – India, Bengaluru

Example Results • Query: trade reserves (Reuters Dataset) • economic minister • reserves • taiwan’s foreign exchange reserves • economic planning • economic planning and development IBM Research – India, Bengaluru

Result Quality Evaluation 1 0.99 0.98 0.97 0.96 0.95 0.94 Prec MRR 0.93 NDCG 0.92 MAP 0.91 0.9 20-AND 50-AND PubMed Dataset IBM Research – India, Bengaluru

Running Times: Disk-based Operation (NRA) 10000000 1000000 OR-GM 100000 10000 AND-GM OR-NRA 1000 AND-NRA 100 0 20 40 60 80 100 PubMed Dataset X-Axis: Percentage of NRA Lists Traversed IBM Research – India, Bengaluru

Percentages of Lists Traversed (NRA) Pubmed-OR Pubmed-AND Reuters-OR Reuters-AND 27 28 29 30 31 32 33 34 IBM Research – India, Bengaluru

Running Times: Mem-based Operation (SMJ) 10000000 OR-GM 1000000 100000 10000 AND-GM 1000 OR-SMJ 100 AND-SMJ 10 1 0 10 20 30 40 50 60 70 80 90 100 PubMed Dataset X-Axis: Percentage of Entries Stored IBM Research – India, Bengaluru

Shortcomings • Index Sizes • Earlier approaches index only phrases and documents • Our method has word-specific indexes, with each word having a list in the index • Number of words across documents could be much more than the number of phrases • If we would like to support querying over all possible words, index sizes could get large • Queries on Metadata Facets • Instead of using keyword queries, document subsets could also be chosen using metadata facets • E.g., venue:sigmod AND year:2007, on a set of scholarly publications • Our independence assumption has not yet been tested on metadata facets IBM Research – India, Bengaluru

Summary • Proposed an approach for the problem of mining interesting phrases from subsets of text corpora • Outlined the query word independence assumption that is seen to be empirically useful in accurately identifying interesting phrases • Our approach is seen to be up to 90% accurate, while being able to achieve turnaround times that are orders of magnitude better than those of the current techniques • Future Work • Other potential avenues for leveraging the independence assumption for phrase analytics • Methods to speed up interesting phrase mining over metadata facets IBM Research – India, Bengaluru

Thank You Questions, Comments, Suggestions? IBM Research - India, Bengaluru, India

Fast Mining of Interesting Phrases from Subsets of Text Corpora

Fast Mining of Interesting Phrases from Subsets of Text Corpora

Presentation Transcript

Text-Mining: analysis of text data

Text Mining

Demonstration of Text Mining

Text Mining: Fast Phrase-based Text Indexing and Matching

10 FAST PHRASES

Interesting Fast Facts

Text mining- text analytics- data mining

Overview of Text Data Mining

Applications of Text Mining

Discovering Interesting Subsets Using Statistical Analysis

Text Mining

Text Mining

Text Mining

Text-Mining: analysis of text data

Text Mining of Medical Documents

Fast Phrases