1 / 86

Machine Learning in the Cloud

Machine Learning in the Cloud. Carlos Guestrin Joe Hellerstein David O’Hallaron. Yucheng Low. Aapo Kyrola. Danny Bickson. Joey Gonzalez. Machine Learning in the Real World. 13 Million Wikipedia Pages. 500 Million Facebook Users. 3.6 Billion Flickr Photos. 24 Hours a Minute

colman
Download Presentation

Machine Learning in the Cloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning in the Cloud Carlos Guestrin Joe Hellerstein David O’Hallaron Yucheng Low Aapo Kyrola Danny Bickson JoeyGonzalez

  2. Machine Learning in the Real World 13 Million Wikipedia Pages 500 Million Facebook Users 3.6 Billion Flickr Photos 24 Hours a Minute YouTube

  3. Parallelism is Difficult • Wide array of different parallel architectures: • Different challenges for each architecture GPUs Multicore Clusters Clouds Supercomputers High Level Abstractions to make things easier.

  4. MapReduce – Map Phase 4 2 . 3 2 1 . 3 2 5 . 8 CPU 1 1 2 . 9 CPU 2 CPU 3 CPU 4 Embarrassingly Parallel independent computation No Communication needed

  5. MapReduce – Map Phase 8 4 . 3 1 8 . 4 8 4 . 4 CPU 1 2 4 . 1 CPU 2 CPU 3 CPU 4 1 2 . 9 4 2 . 3 2 1 . 3 2 5 . 8 Embarrassingly Parallel independent computation No Communication needed

  6. MapReduce – Map Phase 6 7 . 5 1 4 . 9 3 4 . 3 CPU 1 1 7 . 5 CPU 2 CPU 3 CPU 4 8 4 . 3 1 8 . 4 8 4 . 4 1 2 . 9 2 4 . 1 4 2 . 3 2 1 . 3 2 5 . 8 Embarrassingly Parallel independent computation No Communication needed

  7. MapReduce – Reduce Phase 17 26 . 31 22 26 . 26 CPU 1 CPU 2 1 2 . 9 2 4 . 1 1 7 . 5 4 2 . 3 8 4 . 3 6 7 . 5 2 1 . 3 1 8 . 4 1 4 . 9 2 5 . 8 8 4 . 4 3 4 . 3 Fold/Aggregation

  8. MapReduce and ML • Excellent for large data-parallel tasks! Data-Parallel Complex Parallel Structure Is there more to Machine Learning ? Map Reduce Feature Extraction Cross Validation Computing Sufficient Statistics

  9. Iterative Algorithms? • We can implement iterative algorithms in MapReduce: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Slow Processor Data Data Data Data Data Barrier Barrier Barrier

  10. Iterative MapReduce • System is not optimized for iteration: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Startup Penalty Disk Penalty Disk Penalty Startup Penalty Startup Penalty Disk Penalty Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data

  11. Iterative MapReduce • Only a subset of data needs computation: (multi-phase iteration) Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data Barrier Barrier Barrier

  12. MapReduce and ML • Excellent for large data-parallel tasks! Data-Parallel Complex Parallel Structure Is there more to Machine Learning ? Map Reduce Feature Extraction Cross Validation Computing Sufficient Statistics

  13. Structured Problems Example Problem: Will I be successful in research? Success depends on the success of others. May not be able to safely update neighboring nodes. [e.g., Gibbs Sampling] Interdependent Computation: Not Map-Reducible

  14. Space of Problems • Sparse Computation Dependencies • Can be decomposed into local “computation-kernels” • Asynchronous Iterative Computation • Repeated iterations over local kernel computations

  15. Parallel Computing and ML • Not all algorithms are efficiently data parallel ? Data-Parallel Structured Iterative Parallel GraphLab Map Reduce Tensor Factorization Lasso Feature Extraction Cross Validation Kernel Methods Belief Propagation Computing Sufficient Statistics LearningGraphicalModels SVM Sampling Deep Belief Networks Neural Networks

  16. GraphLab Goals • Designed for ML needs • Express data dependencies • Iterative • Simplifies the design of parallel programs: • Abstract away hardware issues • Addresses multiple hardware architectures • Multicore • Distributed • GPU and others

  17. GraphLab Goals Simple Models Complex Models Now Small Data Data-Parallel Goal Large Data

  18. GraphLab Goals Simple Models Complex Models Now Small Data Data-Parallel GraphLab Large Data

  19. GraphLab A Domain-Specific Abstraction for Machine Learning

  20. Everything on a Graph A Graph with data associated with every vertex and edge :Data

  21. Update Functions Update Functions: operations applied on vertex  transform data in scope of vertex

  22. Update Functions Update Function can Schedule the computation of any other update function: - FIFO Scheduling - Prioritized Scheduling - Randomized Etc. Scheduled computation is guaranteed to execute eventually.

  23. Example: Page Rank Graph = WWW multiply adjacent pagerank values with edge weights to get current vertex’s pagerank Update Function: “Prioritized” PageRank Computation? Skip converged vertices.

  24. Example: K-Means Clustering Data (Fully Connected?) Bipartite Graph Clusters Cluster Update: compute average of data connected on a “marked” edge. Data Update: Pick the closest cluster and mark the edge. Unmark remaining edges. Update Function:

  25. Example: MRF Sampling Graph = MRF - Read samples on adjacent vertices - Read edge potentials - Compute new sample for current vertex Update Function:

  26. Not Message Passing! Graph is a data-structure. Update Functions perform parallel modifications to the data-structure.

  27. Safety If adjacent update functions occur simultaneously?

  28. Safety If adjacent update functions occur simultaneously?

  29. Importance of Consistency ML resilient to soft-optimization? Permit Races? “Best-effort” computation? True for some algorithms. Not true for many. May work empirically on some datasets; may fail on others.

  30. Importance of Consistency Many algorithms require strict consistency, or performs significantly better under strict consistency. Alternating Least Squares

  31. Importance of Consistency Fast ML Algorithm development cycle: Build Test Debug Tweak Model Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism. Is the execution wrong? Or is the model wrong?

  32. Sequential Consistency GraphLab guarantees sequential consistency  parallel execution, sequential execution of update functions which produce same result time CPU 1 Parallel CPU2 CPU1 Sequential

  33. Sequential Consistency GraphLab guarantees sequential consistency  parallel execution, sequential execution of update functions which produce same result Formalization of the intuitive concept of a “correct program”. - Computation does not read outdated data from the past - Computation does not read results of computation that occurs in the future. Primary Property of GraphLab

  34. Global Information What if we need global information? Algorithm Parameters? Sufficient Statistics? Sum of all the vertices?

  35. Shared Variables • Global aggregation through Sync Operation • A global parallel reduction over the graph data. • Synced variables recomputed at defined intervals • Sync computationis Sequentially Consistent • Permits correct interleaving of Syncs and Updates Sync: Loglikelihood Sync: Sum of Vertex Values

  36. Sequential Consistency GraphLab guarantees sequential consistency  parallel execution, sequential execution of update functions and Syncs which produce same result time CPU 1 Parallel CPU2 CPU1 Sequential

  37. GraphLab in the Cloud

  38. Moving towards the cloud… • Purchasing and maintaining computers is very expensive • Most computing resources seldomly used • Only for deadlines… • Buy time, access hundreds or thousands of processors • Only pay for needed resources

  39. Distributed GL Implementation • Mixed Multi-threaded / Distributed Implementation. (Each machine runs only one instance) • Requires all data to be in memory. Move computation to data. • MPI for management + TCP/IP for communication • Asynchronous C++ RPC Layer • Ran on 64 EC2 HPC Nodes = 512 Processors Skip Implementation

  40. Underlying Network RPC Controller RPC Controller RPC Controller RPC Controller Execution Engine Execution Engine Execution Engine Execution Engine Execution Engine Shared Data Shared Data Shared Data Shared Data Shared Data Cache Coherent Distributed K-V Store Cache Coherent Distributed K-V Store Cache Coherent Distributed K-V Store Cache Coherent Distributed K-V Store Cache Coherent Distributed K-V Store Distributed Graph Distributed Graph Distributed Graph Distributed Graph Distributed Graph Distributed Locks Distributed Locks Distributed Locks Distributed Locks Distributed Locks Execution Threads Execution Threads Execution Threads Execution Threads Execution Threads

  41. GraphLab RPC

  42. Write distributed programs easily • Asynchronous communication • Multithreaded support • Fast • Scalable • Easy To Use (Every machine runs the same binary)

  43. I C++

  44. Features • Easy RPC capabilities: • One way calls rpc.remote_call([target_machine ID], printf, “%s %d %d %d\n”, “hello world”, 1, 2, 3); • Requests (call with return value) std::vector<int>& sort_vector(std::vector<int> &v) { std::sort(v.begin(), v.end()); return v; } • vec = rpc.remote_request( • [target_machine ID], • sort_vector, • vec);

  45. Features • Object Instance Context MPI-like primitives dc.barrier() dc.gather(...) dc.send_to([target machine], [arbitrary object]) dc.recv_from([source machine], [arbitrary object ref]) K-V Object RPC Controller K-V Object RPC Controller K-V Object RPC Controller K-V Object RPC Controller MPI-Like Safety

  46. Request Latency Ping RTT = 90us

  47. One-Way Call Rate 1Gbps physical peak

  48. Serialization Performance 100,000 X One way call of vector of 10 X {"hello", 3.14, 100}

  49. Distributed Computing Challenges Q1: How do we efficiently distribute the state ? - Potentially varying #machines Q2: How do we ensure sequential consistency ? Keeping in mind: Limited Bandwidth High Latency Performance

  50. Distributed Graph

More Related