1 / 48

Scaling up LDA

Scaling up LDA. (Monday’s lecture). What if you try and parallelize?. Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA”. Common subtask in parallel versions of: LDA, SGD, …. AllReduce. Introduction. Common pattern:

bart
Download Presentation

Scaling up LDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling up LDA (Monday’s lecture)

  2. What if you try and parallelize? Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA” Common subtask in parallel versions of: LDA, SGD, ….

  3. AllReduce

  4. Introduction • Common pattern: • do some learning in parallel • aggregate local changes from each processor • to shared parameters • distribute the new shared parameters • back to each processor • and repeat…. MAP REDUCE some sort of copy

  5. Introduction • Common pattern: • do some learning in parallel • aggregate local changes from each processor • to shared parameters • distribute the new shared parameters • back to each processor • and repeat…. • AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme MAP ALLREDUCE

  6. Gory details of VW Hadoop-AllReduce • Spanning-tree server: • Separate process constructs a spanning tree of the computenodes in the cluster and then acts as a server • Worker nodes (“fake” mappers): • Input for worker is locally cached • Workers all connect to spanning-tree server • Workers all execute the same code, which might contain AllReduce calls: • Workers synchronize whenever they reach an all-reduce

  7. HadoopAllReduce don’t wait for duplicate jobs

  8. Second-order method - like Newton’s method

  9. 2 24 features ~=100 non-zeros/example 2.3B examples example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad

  10. 50M examples explicitly constructed kernel  11.7M features 3,300 nonzeros/example old method: SVM, 3 days: reporting time to get to fixed test error

  11. On-line LDA

  12. Pilfered from… NIPS 2010: Online Learning for LDA, Matthew Hoffman, Francis Bach & Blei

  13. uses λ uses γ

  14. Monday’s lecture

  15. recap

  16. recap Compute expectations over the z’s any way you want….

  17. Technical Details q(zd) not q(zd)! Variationaldistrib: Approximate using Gibbs: after sampling for a while estimate: estimate using time and “coherence”: D(w) = # docs containing word w

  18. better

  19. Summary of LDA speedup tricks • Gibbs sampler: • O(N*K*T) and K grows with N • Need to keep the corpus (and z’s) in memory • You can parallelize • You need to keep a slice of the corpus • But you need to synchronize K multinomials over the vocabulary • AllReduce would help? • You can sparsify the sampling and topic-counts • Mimno’s trick - greatly reduces memory • You can do the computation on-line • Only need to keep K-multinomials and one document’s worth of corpus and z’s in memory • You can combine some of these methods • Online sparsified LDA • Parallel online sparsified LDA?

More Related