1 / 20

SamudraManthan Popular terms

SamudraManthan Popular terms. Dinesh Bhirud Prasad Kulkarni Varada Kolhatkar. Architecture. R E DU C T I ON. create datastructures. Ngram pruning. Intra-process reduction. Finding Top Ngrams. create datastructures. Ngram pruning. Intra-process reduction. create datastructures.

lucy-price
Download Presentation

SamudraManthan Popular terms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SamudraManthan Popular terms Dinesh Bhirud Prasad Kulkarni Varada Kolhatkar

  2. Architecture R E DU C T I ON create datastructures Ngram pruning Intra-process reduction Finding Top Ngrams create datastructures Ngram pruning Intra-process reduction create datastructures Ngram pruning Intra-process reduction WORKER PROCESSORS M ANAG E R

  3. Data Distribution MANAGER P0 Signal Ready (W->M) Data msg (M->W) Next ready signal (W->M) . . . 4. Terminate msg (M->W) HANDSHAKE MODULE WORKER PN WORKER P1 WORKER P2 Handshake protocol

  4. Data Distribution (contd…) Manager (processor 0) reads an article and passes it on to the other processors(workers) in a round-robin fashion Before sending new article to the same worker, manager waits till the worker is ready to receive more data. Worker processes the article and creates data structures before receiving new article. Sends & receives are synchronous.

  5. Suffix Array, LCP Vector And Equivalence Classes Suffix array is a sorted array of suffixes LCP vector keeps track of repeating terms in suffixes We use suffix arrays and LCP vector to partition articles into classes Each class represents a group of Ngrams Classes represent all Ngrams in the article and no Ngram is represented more than once

  6. Example A ROSE IS A ROSE

  7. Advantages of Suffix Arrays • Time Complexity • There can be at the most 2N-1 classes, where N is number of words in an article • Ngrams of all/any sizes can be identified with their tfs in linear time • These data structures enable us to represent all and any sized Ngrams without actually storing them

  8. Intra-Processor Reduction Problem Suffix array data structure gives us article level unique Ngrams with term frequencies A processor processes multiple articles Need to identify unique Ngrams across articles Need to have an unique identifier for each word

  9. Dictionary – Our Savior Dictionary is a sorted list of all unique words in the Gigaword corpus Dictionary ids form a unified basis for intra/inter process reduction

  10. Intra-Processor Reduction • Used a hash table to store unique Ngrams with tf and df • Hashing function • Simple mod hashing function • H(t)= ∑ t(i) mod HASH_SIZE, where t(i) is the dictionary id of word i in Ngram t • Hash data structure struct ngramstore { int *word_id; int cnt; int doc_freq; struct ngramstore *chain; };

  11. Steps i varies from 0 to log(n) - 1 Send -> Recv diff = (2 ^ i) For any iteration, recv if(id % (2^i) == 0) else sender max_recv = (reductions-1) * (int)pow((double)2, i+1); Processors enter next iteration by calling MPI_Barrier() 1 -> 0 3 -> 2 5 -> 4 7 -> 6 2 -> 0 6 -> 4 4 -> 0 0 2 4 1 6 3 5 7 Inter-Process Reduction Binomial Tree

  12. Inter-Process Reduction using Hashing Reusing our hash technique and code from intra-process reduction All processes use binomial tree collection pattern to reduce unique Ngrams After log n steps process 0 has the final hash with all unique Ngrams

  13. Scaling up to GigaWord? Goal • Reduce per processor memory requirement Cut off term frequency • Ngrams with low tf are not going to score high • Observation : 66 % of total trigrams have term frequency 1 in 1.2GB data • Unnecessary to carry such Ngrams • Solution: Eliminate Ngrams with very low term frequency

  14. Pruning – stoplist motivation • Similarly Ngrams with high df are not going to score high. • Memory hotspot • This elimination can be done only after intra-process collection • Defeats the goal of per processor memory reduction • Need for an adaptive elimination

  15. Pruning - Stoplist Ngrams such as "IN THE FIRST" scored high using TF*IDF measure Eliminate such Ngrams to extract really interesting terms Stoplist is a list of commonly occurring words such as “the”, “a”, “to”, “from”, “is”, “first” Stoplist is based on our dictionary Still evolving and currently contains 160 words Eliminate Ngrams containing all words from the stoplist

  16. Interesting 3-grams on GigaWord

  17. Performance Analysis - Speedup

  18. Space Complexity • Memory requirement increases for higher order Ngrams • Why? • Suppose there are n unique Ngrams in each article and m such articles • For higher order Ngrams, the number of unique ngrams increase • We store each unique Ngram in our hash data structure • In worst case all Ngrams across articles are unique. We have to store mn unique Ngrams per processor

  19. Current Limitations • Static Dictionary • M through N Interesting Ngrams • Our hash data structure is designed to handle a single sized Ngram at a time • We provide M through N functionality by repetitively building all data structures • Not a scalable approach

  20. Thanks 

More Related