Large-scale Recommender Systems on Just a PC

Large-scale Recommender Systems on Just a PC Aapo Kyrölä Ph.D. candidate @ CMU http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov LSRS 2013 keynote (RecSys’13 Hong Kong) Big Data – small machine

My Background • Academic: 5th year Ph.D. @ Carnegie Mellon. Advisors: Guy Blelloch, Carlos Guestrin (UW) 2009  2012  + Shotgun : Parallel L1-regularized regression solver (ICML 2011). + Internships at MSR Asia (2011) and Twitter (2012) • Startup Entrepreneur Habbo : founded 2000

Outline of this talk • Why single-computer computing? • Introduction to graph computation and GraphChi • Recommender systems with GraphChi • Future directions & Conclusion

Large-Scale Recommender Systems on Just a PC Why on a single machine? Can’t we just use the Cloud?

Why use a cluster? Two reasons: • One computer cannot handle my problem in a reasonable time. • I need to solve the problem very fast.

Why use a cluster? Two reasons: • One computer cannot handle my problem in a reasonable time. • I need to solve the problem very fast. • Our work expands the space of feasible (graph) problems on one machine: • Our experiments use the same graphs, or bigger, than previous papers on distributed graph computation. (+ we can do Twitter graph on a laptop) • Most data not that “big”. Our work raises the bar on required performance for a “complicated” system.

Benefits of single machine systems Assuming it can handle your big problems… • Programmer productivity • Global state • Can use “real data” for development • Inexpensive to install, administer, less power. • Scalability.

Efficient Scaling Distributed Graph System Single-computer system (capable of big tasks) Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 6 machines (Significantly) less than 2x throughput with 2x machines Task 1 Exactly 2x throughput with 2x machines Task 2 Task 3 Task 4 Task 5 Task 6 Task 10 Task 11 Task 12 12 machines Time Time T T

Graph computation and graphchi

Why graphs for recommender systems? • Graph = matrix: edge(u,v) = M[u,v] • Note: always sparse graphs • Intuitive, human-understandable representation • Easy to visualize and explain. • Unifies collaborative filtering (typically matrix based) with recommendation in social networks. • Random walk algorithms. • Local view  vertex-centric computation

Vertex-Centric Computational Model • Graph G = (V, E) • directed edges: e = (source, destination) • each edge and vertex associated with a value (user-defined type) • vertex and edge values can be modified • (structure modification also supported) A B Data Data Data Data Data Data Data Data Data Data GraphChi – Aapo Kyrola

Vertex-centric Programming • “Think like a vertex” • Popularized by the Pregel and GraphLabprojects Data Data Data Data Data Data Data Data Data Data MyFunc(vertex) { // modify neighborhood } Data Data Data Data Data

What is GraphChi • 2 Both in OSDI’12!

The Main Challenge of Disk-based Graph Computation: Random Access << 5-10 M random edges / sec to achieve “reasonable performance” 100s reads/writes per sec ~ 100K reads / sec (commodity) ~ 1M reads / sec (high-end arrays)

Details: Kyrola, Blelloch, Guestrin: “Large-scale graph computation on just a PC” (OSDI 2012) Parallel Sliding Windows or Only P large reads for each interval (sub-graph). P2 reads on one full pass.

GraphChi Program Execution For T iterations: For p=1 to P For v in interval(p) updateFunction(v) For T iterations: For v=1 to V updateFunction(v) “Asynchronous”: updates immediately visible (vs. bulk-synchronous).

Performance GraphChi can compute on the full Twitter follow-graph with just a standard laptop. ~ as fast as a very large Hadoop cluster! (size of the graph Fall 2013, > 20B edges [Gupta et al 2013])

GraphChi is Open Source • C++ and Java-versions in GitHub: http://github.com/graphchi • Java-version has a Hadoop/Pig wrapper. • If you really really want to use Hadoop.

Recsysmodeltrainingwith graphchi

Overview of Recommender Systems for GraphChi • Collaborative Filtering toolkit (next slide) • Link prediction in large networks • Random-walk based approaches (Twitter) • Talk on Wednesday.

GraphChi’sCollaborative Filtering Toolkit • Developed by Danny Bickson(CMU / GraphLabInc) • Includes: • Alternative Least Squares (ALS) • Sparse-ALS • SVD++ • LibFM (factorization machines) • GenSGD • Item-similarity based methods • PMF • CliMF (contributed by Mark Levy) • …. See Danny’s blog for more information: http://bickson.blogspot.com/2012/12/collaborative-filtering-with-graphchi.html Note: In the C++ -version. Java-version in development by a CMU team.

Two examples: ALS and item-based CF

Example: Alternative Least Squares Matrix Factorization (ALS) • Task: Predict ratings for items (movies) by users. • Model: • Latent factor model (see next slide) Reference: Y. Zhou, D. Wilkinson, R. Schreiber, R. Pan: “Large-Scale Parallel Collaborative Filtering for the Netflix Prize” (2008)

ALS: Product – Item bipartite graph Women on the Verge of aNervous Breakdown 4 3 The Celebration City of God 2 Wild Strawberries 5 • User’s rating of a movie modeled as a dot-product: • <factor(user), factor(movie)> La Dolce Vita

ALS: GraphChi implementation • Update function handles one vertex a time (user or movie) • For each user: • Estimate latent(user): minimize least squares of dot-product predicted ratings • GraphChi executes the update function for each vertex (in parallel), and loads edges (ratings) from disk • Latent factors in memory: need O(V) memory. • If factors don’t fit in memory, can replicate to edges. and thus store on disk Scales to very large problems!

ALS: Performance Matrix Factorization (Alternative Least Squares) Remark: Netflix is not a big problem, but GraphChi will scale at most linearly with input size (ALS is CPU bounded, so should be sub-linear in #ratings).

Example: Item Based-CF • Task: compute a similarity score [e,g. Jaccard] for each movie-pair that has at least one viewer in common. • Similarity(X, Y) ~ # common viewers • Output top K similar items for each item to a file. • … or: create edge between X, Y containing the similarity. • Problem: enumerating all pairs takes too much time.

Women on the Verge of aNervous Breakdown Solution: Enumerate all triangles of the graph. 3 The Celebration New problem: how to enumerate triangles if the graph does not fit in RAM? City of God Wild Strawberries La Dolce Vita

Enumerating Triangles (Item-CF) • Triangles with edge (u, v) = intersection(neighbors(u), neighbors(v)) • Iterative memory efficient solution (next slide)

Algorithm: • Let pivots be a subset of the vertices; • Load all neighbor-lists (adjacency lists) of pivots into RAM • Use now GraphChi to load all vertices from disk, one by one, and compare their adjacency lists to the pivots’ adjacency lists (similar to merge). • Repeat with a new subset of pivots. PIVOTS

Triangle Counting Performance Triangle Counting

Future directions & Final remarks

Single-Machine Computing in Production? • GraphChi supports incremental computation with dynamic graphs: • Can keep on running indefinitely, adding new edges to the graph  Constantly fresh model. • However, requires engineering – not included in the toolkit. • Compare to a cluster-based system (such as Hadoop) that needs to compute from scratch.

Unified Recsys Platform for GraphChi? • Working with masters students at CMU. • Goal: ability to easily compare different algorithms, parameters • Unified input, output. • General programmable API (not just file-based) • Evaluation process: Several evaluation metrics; Cross-validation, held-out data… • Run many algorithm instances in parallel, on same graph. • Java. • Scalable from the get-go.

Recent developments: Disk-based Graph Computation • Recently two disk-based graph computation systems published: • TurboGraph (KDD’13) • X-Stream (SOSP’13 in October) • Significantly better performance than GraphChi on many problems • Avoid preprocessing (“sharding”) • But GraphChi can do some computation that X-Stream cannot (triangle counting and related); TurboGraph requires SSD • Hot research area!

Do you need GraphChi – or any system? • Heck, for many algorithms, you can just mmap() over your (binary) adjacency list / sparse matrix, and write a for-loop. • See Lin, Chau, Kang Leveraging Memory Mapping for Fast and Scalable Graph Computation on a PC (Big Data ’13) • Obviously good to have a common API • And some algos need more advanced solutions (like GraphChi, X-Stream, TurboGraph) Beware of the hype!

Conclusion • Very large recommender algorithms can now be run on just your PC or laptop. • Additional performance from multi-core parallelism. • Great for productivity – scale by replicating. • In general, good single machine scalability requires care with data structures, memory management  natural with C/C++, with Java (etc.) need low-level byte massaging. • Frameworks like GraphChi hide the low-level. • More work needed to ‘’productize’’ current work.

Thank you! Aapo Kyrölä Ph.D. candidate @ CMU – soon to graduate! (Currently visiting U.W) http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov

Large-scale Recommender Systems on Just a PC

Large-scale Recommender Systems on Just a PC

Presentation Transcript

Recommender systems

Thesis Defense Large -Scale Graph Computation on Just a PC

Large-scale adaptive systems

Large-Scale Distributed Systems

Recommender Systems

Large-scale adaptive systems

Large-scale adaptive systems

Large-scale adaptive systems

Large-scale adaptive systems

GraphChi : Large-Scale Graph Computation on Just a PC

Recommender Systems

Large-Scale Distributed Systems

Large Scale Computing Systems

Large Scale File Systems

Large-scale adaptive systems

Large-Scale Systems

Recommender Systems

Recommender Systems

Recommender Systems

Recommender Systems

Recommender Systems