1 / 16

Fast Mining of Interesting Phrases from Subsets of Text Corpora

Fast Mining of Interesting Phrases from Subsets of Text Corpora. Deepak P , Atreyee Dey, Debapriyo Majumdar* 1 IBM Research - India, Bengaluru, INDIA. EDBT 2014 Conference, Athens, Greece. *presently with Indian Statistical Institute, Kolkata, India. Problem Description. D’. D.

Download Presentation

Fast Mining of Interesting Phrases from Subsets of Text Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Mining of Interesting Phrasesfrom Subsets of Text Corpora Deepak P, Atreyee Dey, Debapriyo Majumdar* 1IBM Research - India, Bengaluru, INDIA EDBT 2014 Conference, Athens, Greece IBM Research - India, Bengaluru, India *presently with Indian Statistical Institute, Kolkata, India

  2. Problem Description D’ D ukraine, crimea … Chosen Subset Text Corpus Crimea independence, 0.90 USA Russia Relations, 0.85 G8 Membership, 0.81 … … Given a text corpus D, and a subset D’, specified by a keyword query, find the top-k Interesting Phrases for D’ wrt D IBM Research – India, Bengaluru

  3. Earlier Approaches Phrase Indexing, Simistis et al., VLDB 2008 p1 d12 d13 d30 d9901 p9876 d1 d11 d305 d8100 O(|P|) Document Indexing, Bedathur et al., VLDB 2010 and Gao & Michel, EDBT 2012 d1 p5 p43 p167 p8970 d9998 p23 p49 p305 p9987 O(|D’|) IBM Research – India, Bengaluru

  4. Estimating Interestingness: AND Query • Consider an AND query composed of k key-words • Q = {Q1, Q2, …, Qk} IBM Research – India, Bengaluru

  5. Query Word Independence Assumption Consider an AND Query of two words Q1 and Q2 We would like to estimate p(P1|Q1, Q2) as an estimate of the interestingness of P1 Instead, we could estimate p(Q1, Q2|P1) (as shown in previous slide) For OR Query Handling details, refer to the paper IBM Research – India, Bengaluru

  6. Our Disk-Resident Indexes w1 p30 p12 p990 p13 0.23 0.21 0.18 0.002 w9876 p810 p11 p305 p8 0.1 0.08 0.007 0.0001 The score that is stored along with each phrase is p(w|p) All values are stored in sorted order IBM Research – India, Bengaluru

  7. Aggregation Approach: NRA • We use the well-known NRA algorithm to do aggregation of the lists corresponding to the query words, to arrive at the top phrases • At any point, we have upper and lower bounds. An example sum-aggregation below P1 – [0.1547, 0.1547] P5 – [0.0333,0.1433] P103 – [0.26, 0.2933] … … P1, 0.04167 P5, 0.0333 w1 P103, 0.26 P1, 0.113 w2 IBM Research – India, Bengaluru

  8. Our In-Memory Indexes w1 p12 p13 p30 p990 0.21 0.002 0.23 0.18 w9876 p8 p11 p305 p810 0.0001 0.08 0.007 0.1 The score that is stored along with each phrase is p(w|p) All values are stored in PhraseID sorted order Indexes may be created by preserving just the top-10% values of each list We will use simple Sort-Merge-Join on these lists for In-Memory operation IBM Research – India, Bengaluru

  9. Example Results • Query: trade reserves (Reuters Dataset) • economic minister • reserves • taiwan’s foreign exchange reserves • economic planning • economic planning and development IBM Research – India, Bengaluru

  10. Result Quality Evaluation 1 0.99 0.98 0.97 0.96 0.95 0.94 Prec MRR 0.93 NDCG 0.92 MAP 0.91 0.9 20-AND 50-AND PubMed Dataset IBM Research – India, Bengaluru

  11. Running Times: Disk-based Operation (NRA) 10000000 1000000 OR-GM 100000 10000 AND-GM OR-NRA 1000 AND-NRA 100 0 20 40 60 80 100 PubMed Dataset X-Axis: Percentage of NRA Lists Traversed IBM Research – India, Bengaluru

  12. Percentages of Lists Traversed (NRA) Pubmed-OR Pubmed-AND Reuters-OR Reuters-AND 27 28 29 30 31 32 33 34 IBM Research – India, Bengaluru

  13. Running Times: Mem-based Operation (SMJ) 10000000 OR-GM 1000000 100000 10000 AND-GM 1000 OR-SMJ 100 AND-SMJ 10 1 0 10 20 30 40 50 60 70 80 90 100 PubMed Dataset X-Axis: Percentage of Entries Stored IBM Research – India, Bengaluru

  14. Shortcomings • Index Sizes • Earlier approaches index only phrases and documents • Our method has word-specific indexes, with each word having a list in the index • Number of words across documents could be much more than the number of phrases • If we would like to support querying over all possible words, index sizes could get large • Queries on Metadata Facets • Instead of using keyword queries, document subsets could also be chosen using metadata facets • E.g., venue:sigmod AND year:2007, on a set of scholarly publications • Our independence assumption has not yet been tested on metadata facets IBM Research – India, Bengaluru

  15. Summary • Proposed an approach for the problem of mining interesting phrases from subsets of text corpora • Outlined the query word independence assumption that is seen to be empirically useful in accurately identifying interesting phrases • Our approach is seen to be up to 90% accurate, while being able to achieve turnaround times that are orders of magnitude better than those of the current techniques • Future Work • Other potential avenues for leveraging the independence assumption for phrase analytics • Methods to speed up interesting phrase mining over metadata facets IBM Research – India, Bengaluru

  16. Thank You Questions, Comments, Suggestions? IBM Research - India, Bengaluru, India

More Related