Joseph Gonzalez Joint work with

81 Views

Download Presentation
## Joseph Gonzalez Joint work with

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**@**The Next Generation of the GraphLab Abstraction. Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola**... a popular answer:**Map-Reduce / Hadoop Build learning algorithms on-top of high-level parallel abstractions**MapReduce – Map Phase**4 2 . 3 2 1 . 3 2 5 . 8 CPU 1 1 2 . 9 CPU 2 CPU 3 CPU 4 Embarrassingly Parallel independent computation No Communication needed**MapReduce – Map Phase**8 4 . 3 1 8 . 4 8 4 . 4 CPU 1 2 4 . 1 CPU 2 CPU 3 CPU 4 1 2 . 9 4 2 . 3 2 1 . 3 2 5 . 8 Image Features**MapReduce – Map Phase**6 7 . 5 1 4 . 9 3 4 . 3 CPU 1 1 7 . 5 CPU 2 CPU 3 CPU 4 8 4 . 3 1 8 . 4 8 4 . 4 1 2 . 9 2 4 . 1 4 2 . 3 2 1 . 3 2 5 . 8 Embarrassingly Parallel independent computation No Communication needed**MapReduce – Reduce Phase**Attractive Face Statistics Ugly Face Statistics 17 26 . 31 22 26 . 26 CPU 1 CPU 2 1 2 . 9 2 4 . 1 1 7 . 5 4 2 . 3 8 4 . 3 6 7 . 5 2 1 . 3 1 8 . 4 1 4 . 9 2 5 . 8 8 4 . 4 3 4 . 3 Image Features**Map-Reduce for Data-Parallel ML**• Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Is there more to Machine Learning ? Map Reduce Label Propagation Lasso Feature Extraction Cross Validation Belief Propagation Kernel Methods Computing Sufficient Statistics Tensor Factorization PageRank Neural Networks Deep Belief Networks**Concrete Example**Label Propagation**Label Propagation Algorithm**• Social Arithmetic: • Recurrence Algorithm: • iterate until convergence • Parallelism: • Compute all Likes[i] in parallel Sue Ann 50% What I list on my profile 40% Sue Ann Likes 10% Carlos Like 80% Cameras 20% Biking 40% + I Like: 60% Cameras, 40% Biking Profile 50% 50% Cameras 50% Biking Me Carlos 30% Cameras 70% Biking 10%**Properties of Graph Parallel Algorithms**Dependency Graph Factored Computation Iterative Computation What I Like What My Friends Like**Map-Reduce for Data-Parallel ML**• Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Map Reduce Map Reduce? ? Label Propagation Lasso Feature Extraction Cross Validation Belief Propagation Kernel Methods Computing Sufficient Statistics Tensor Factorization PageRank Neural Networks Deep Belief Networks**Data Dependencies**• Map-Reduce does not efficiently express dependent data • User must code substantial data transformations • Costly data replication Independent Data Rows**Iterative Algorithms**• Map-Reduce not efficiently express iterative algorithms: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Slow Processor Data Data Data Data Data Barrier Barrier Barrier**MapAbuse: Iterative MapReduce**• Only a subset of data needs computation: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data Barrier Barrier Barrier**MapAbuse: Iterative MapReduce**• System is not optimized for iteration: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Startup Penalty Disk Penalty Disk Penalty Startup Penalty Startup Penalty Disk Penalty Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data**Map-Reduce for Data-Parallel ML**• Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Map Reduce Pregel (Giraph)? Map Reduce? SVM Lasso Feature Extraction Cross Validation Belief Propagation Kernel Methods Computing Sufficient Statistics Tensor Factorization PageRank Neural Networks Deep Belief Networks**Pregel (Giraph)**• Bulk Synchronous Parallel Model: Compute Communicate Barrier**Problem**Bulk synchronous computation can be highly inefficient. Example:Loopy Belief Propagation**Problem with Bulk Synchronous**• Example Algorithm: If Red neighbor then turn Red • Bulk Synchronous Computation : • Evaluate condition on all vertices for every phase 4 Phases each with 9 computations 36 Computations • Asynchronous Computation (Wave-front) : • Evaluate condition only when neighbor changes 4 Phases each with 2 computations 8 Computations Time 0 Time 2 Time 3 Time 4 Time 1**Data-Parallel Algorithms can be Inefficient**Optimized in Memory Bulk Synchronous Asynchronous Splash BP The limitations of the Map-Reduce abstraction can lead to inefficient parallel algorithms.**The Need for a New Abstraction**• Map-Reduce is not well suited for Graph-Parallelism Data-Parallel Graph-Parallel Map Reduce Pregel (Giraph) Feature Extraction Cross Validation Belief Propagation Kernel Methods SVM Computing Sufficient Statistics Tensor Factorization PageRank Lasso Neural Networks Deep Belief Networks**The GraphLab Framework**Scheduler Graph Based Data Representation Update Functions User Computation Consistency Model**Data Graph**A graph with arbitrary data (C++ Objects) associated with each vertex and edge. • Graph: • Social Network • Vertex Data: • User profile text • Current interests estimates • Edge Data: • Similarity weights**Update Functions**An update function is a user defined program which when applied to a vertex transforms the data in the scopeof the vertex label_prop(i, scope){ // Get Neighborhood data (Likes[i], Wij, Likes[j]) scope; // Update the vertex data // Reschedule Neighbors if needed if Likes[i] changes then reschedule_neighbors_of(i); }**The Scheduler**The scheduler determines the order that vertices are updated. b d a c CPU 1 c b e f g Scheduler e f b a i k h j i h i j CPU 2 The process repeats until the scheduler is empty.**The GraphLab Framework**Scheduler Graph Based Data Representation Update Functions User Computation Consistency Model**Ensuring Race-Free Code**• How much can computation overlap?**GraphLab Ensures Sequential Consistency**For each parallel execution, there exists a sequential execution of update functions which produces the same result. time CPU 1 Parallel CPU 2 Single CPU Sequential**Consistency Rules**Full Consistency Data Guaranteed sequential consistency for all update functions**Full Consistency**Full Consistency**Obtaining More Parallelism**Full Consistency Edge Consistency**Edge Consistency**Edge Consistency Safe Read CPU 1 CPU 2**Consistency Through R/W Locks**• Read/Write locks: • Full Consistency • Edge Consistency Write Write Write Canonical Lock Ordering Read Read Write Read Write**GraphLab for Natural Graphs**the Achilles heel**Problem: High Degree Vertices**• Graphs with high degree vertices are common: • Power-Law Graphs (Social Networks): • Affects algorithms like label-propagation • Probabilistic Graphical Models: • Hyper-parameters which couple large sets of data • Connectivity structure induced by natural phenomena • High degree vertices kill parallelism: Pull a Large Amount of State Requires Heavy Locking Processed Sequentially**Proposed Four Solutions**• Associative Commutative Update Functors • Transition from stateless to statefulupdate functions • Decomposable Update Functors • Expose greater parallelism by further factoring update functions • Abelian Group Caching (concurrent revisions) • Allows for controllable races through diff operations • Stochastic Scopes • Reduce degree through sampling**Associative Commutative Update Functors**• Associative commutative functions which: i.e., 1 + (2 + 3) • Subsumes Message Passing, Traveler Model, and Pregel (Giraph) • Function composition equivalent to combiner in Pregel LabelPropUF(i, scope,Δ) // Get Neighborhood data (OldLikes[i], Likes[i], Wij) scope; // Update the vertex data Likes[i] Likes[i] + Δ; // Reschedule Neighbors if needed if |Likes[i] – OldLikes[i]| > ε then for all neighbors j Δ (Likes[i] – OldLikes[i]) * Wij; reschedule(j,Δ); OldLikes[i] Likes[i] Update Functor composition: • Achieved by defining anoperator+ for update functors • Update functors composition may be computed in parallel**Advantages of Update Functors**• Eliminates the need to read neighbor values • Reduces locking overhead • Enables task de-duplication and hinting: • User defined (+) can choose to eliminate orcompose • Dynamically control execution parameters: • priority(), consistency_level(), is_decomposable() … • Needed to to enable further extensions • Decomposable update functors etc… • Simplifies schedulers: • Many update functions single update functor**Disadvantages of Update Functors**• Typically constructed sequentially • Requires access to edge weights • Problems exist in Pregel/Giraph as well • Changed required substantial engineering effort • Affected 2/3 of the GraphLab Library lpF(i, scope,Δ) // Update the vertex data Likes[i] Likes[i] + Δ; // Reschedule Neighbors if needed if |Likes[i] – OldLikes[i]| > ε then for all neighbors j Δ (Likes[i] – OldLikes[i]) * Wij; reschedule(j,Δ); OldLikes[i] Likes[i]**Decomposable Update Functors**Breaking computation over edges of the graph.**Decomposable Update Functors**• Decompose update functions into 3 phases: • Locks are acquired only for region within a scope Relaxed Consistency Gather Apply Scatter Scope Y Y Y Y Apply the accumulated value to center vertex Parallel Sum + + … + Δ Update adjacent edgesand vertices. Y Y Y Y User Defined: User Defined: User Defined: Apply( , Δ) Gather( ) Δ Scatter( ) Y Y Δ1 + Δ2 Δ3**Decomposable Update Functors**• Implementing Label Propagation with Factorized Update Functions: • Implemented using update functors state as accumulator: • Uses same (+) operator to merge partial gathers: Scatter(i, scope){ // Get Neighborhood data (Likes[i], Wij,Likes[j]) scope; // Reschedule if changed if Likes[i] changed then reschedule(j); } Gather(i,j, scope) // Get Neighborhood data (Likes[i], Wij,Likes[j]) // Emit accumulator emit Wij x Likes[j] as Δ; Apply(i, scope, Δ){ // Get Neighborhood (Likes[i]) scope; // Update the vertex Likes[i] Δ; } Δ1 + Δ2 Δ3 ( + )( ) F1 F2 F1 F2 Y Y**Decomposable Functors**• Fits many algorithms • Loopy Belief Propagation, Label Propagation, PageRank… • Addresses the earlier concerns • Problem:High degree vertices will still get locked frequently and will need to be up-to-date Large State Heavy Locking Sequential Parallel Gather and Scatter Fine Grained Locking Distributed Gather and Scatter**Abelian Group Caching**Enabling eventually consistent data races**Abelian Group Caching**• Issue: • All earlier methods maintain a sequentially consistent view of data across all processors. • Proposal: Try to split data instead of computation. • How can we split the graph without changing the update function? CPU 4 CPU 3 CPU 1 CPU 2 CPU 2 CPU 3 CPU 4 CPU 1**Insight from WSDM paper**• Answer: Allow Eventually Consistent data races • High degree vertices admit slightly “stale” values: • Changes in a few elements negligible effect • High degreevertex updates typically a form of “sum” operation which has an “inverse” • Example: Counts, Averages, Sufficient statistics • Counter Example: Max • Goal:Lazily synchronize duplicate data • Similar to a version control system • Intermediate values partially consistent • Final value at termination must be consistent**Example**• Every processor initial has a copy of the same central value: Master 10 10 10 10 Current Current Current Processor 1 Processor 2 Processor 3