Download
scaling up lda 2 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Scaling up LDA - 2 PowerPoint Presentation
Download Presentation
Scaling up LDA - 2

Scaling up LDA - 2

88 Views Download Presentation
Download Presentation

Scaling up LDA - 2

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Scaling up LDA - 2 William Cohen

  2. SPEEDUP FOR Parallel LDA - USING ALLREDUCE FOR Synchronization

  3. What if you try and parallelize? Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA” Common subtask in parallel versions of: LDA, SGD, ….

  4. Introduction • Common pattern: • do some learning in parallel • aggregate local changes from each processor • to shared parameters • distribute the new shared parameters • back to each processor • and repeat…. • AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme MAP ALLREDUCE

  5. Gory details of VW Hadoop-AllReduce • Spanning-tree server: • Separate process constructs a spanning tree of the computenodes in the cluster and then acts as a server • Worker nodes (“fake” mappers): • Input for worker is locally cached • Workers all connect to spanning-tree server • Workers all execute the same code, which might contain AllReduce calls: • Workers synchronize whenever they reach an all-reduce

  6. HadoopAllReduce don’t wait for duplicate jobs

  7. Second-order method - like Newton’s method

  8. 2 24 features ~=100 non-zeros/example 2.3B examples example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad

  9. 50M examples explicitly constructed kernel  11.7M features 3,300 nonzeros/example old method: SVM, 3 days: reporting time to get to fixed test error

  10. MORE LDA SPEEDUPSFirst - RECAP LDA DEtails

  11. More detail

  12. z=1 z=2 random z=3 unit height … …

  13. SPEEDUP 1 - Sparsity

  14. z=1 z=2 random z=3 unit height … …

  15. Running total of P(z=k|…) or P(z<=k)

  16. Discussion…. • Where do you spend your time? • sampling the z’s • each sampling step involves a loop over all topics • this seems wasteful • even with many topics, words are often only assigned to a few different topics • low frequency words appear < K times … and there are lots and lots of them! • even frequent words are not in every topic

  17. Discussion…. Idea: come up with approximations to Z at each stage - then you might be able to stop early….. • What’s the solution? Want Zi>=Z

  18. Tricks • How do you compute and maintain the bound? • see the paper • What order do you go in? • want to pick large P(k)’s first • … so we want large P(k|d) and P(k|w) • … so we maintain k’s in sorted order • which only change a little bit after each flip, so a bubble-sort will fix up the almost-sorted array

  19. Results

  20. Results

  21. Results

  22. SPEEDUP 2 - ANOTHER APPROACH FOR USING Sparsity

  23. KDD 09

  24. z=s+r+q

  25. If U<s: • lookup U on line segment with tic-marks at α1β/(βV + n.|1), α2β/(βV + n.|2), … • If s<U<r: • lookup U on line segment for r Only need to check t such that nt|d>0 z=s+r+q

  26. If U<s: • lookup U on line segment with tic-marks at α1β/(βV + n.|1), α2β/(βV + n.|2), … • If s<U<s+r: • lookup U on line segment for r • If s+r<U: • lookup U on line segment for q Only need to check t such that nw|t>0 z=s+r+q

  27. Only need to check occasionally (< 10% of the time) Only need to check t such that nt|d>0 Only need to check t such that nw|t>0 z=s+r+q

  28. Trick; count up nt|dfor d when you start working on d and update incrementally Only need to store (and maintain) total words per topic and α’s,β,V Only need to storent|dfor current d Need to storenw|t for each word, topic pair …??? z=s+r+q

  29. 1. Precompute, for each t, 2. Quickly find t’s such that nw|tis large for w Most (>90%) of the time and space is here… Need to storenw|t for each word, topic pair …??? z=s+r+q

  30. 1. Precompute, for each t, 2. Quickly find t’s such that nw|tis large for w • map w to an int array • no larger than frequency w • no larger than #topics • encode (t,n) as a bit vector • n in the high-order bits • t in the low-order bits • keep ints sorted in descending order Most (>90%) of the time and space is here… Need to storenw|t for each word, topic pair …???

  31. SPEEDUP 3 - Online LDA

  32. Pilfered from… NIPS 2010: Online Learning for LDA, Matthew Hoffman, Francis Bach & Blei

  33. ASIDE: VARIATIONAL INFERENCE FOR LDA