1 / 59

Graphs, Algorithms and Big Data: the Google AdWords Case study

Graphs, Algorithms and Big Data: the Google AdWords Case study. GDG DevFest Central Italy 2013. Alessandro Epasto. Joint work with J . Feldman, S. Lattanzi , V . Mirrokni (Google Research), S. Leonardi ( Sapienza U. Rome), H. Lynch (Google) and the AdWords team.

marged
Download Presentation

Graphs, Algorithms and Big Data: the Google AdWords Case study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Graphs, Algorithms and Big Data: the Google AdWords Case study GDG DevFest Central Italy 2013 Alessandro Epasto

  2. Joint work with J. Feldman, S. Lattanzi, V. Mirrokni(Google Research), S. Leonardi(Sapienza U. Rome), H. Lynch (Google) and the AdWords team.

  3. The AdWords Problem

  4. The AdWords Problem ?

  5. The AdWords Problem ?

  6. The AdWords Problem Soccer Shoes

  7. The AdWords Problem Soccer Shoes

  8. Google Advertisement in Numbers • Over a billion of query a day. • A lot ofadvertisers. www.google.com/competition/howgooglesearchworks.html

  9. Challenges • Several scientific and technological challenges. • How to find in real-time the best ads? • How to price each ads? • How to suggest new queries to advertisers? • The solution to these problems involves some fundamental scientific results (e.g. a NobelPrize-winning auction mechanism)

  10. Google Advertisement in Numbers 2012 Revenues: 46 billions USD • 95%Advertisement: 43 billions USD. http://investor.google.com/financial/tables.html

  11. Goals of the Project • Tackling AdWords data to identify automatically, for each advertiser, its main competitors and suggest relevant queries to each advertiser. • Goals: • Useful business information. • Improve advertisement. • More relevant performance benchmarks.

  12. Information Deluge Large advertisers (e.g. Amazon, Ask.com, etc) compete in several market segments with very different advertisers.

  13. Representing the data • How to represent the salient features of the data? • Relationships between advertisers and queries • Statistics: clicks, costs, etc. • Take into account the categories. • Efficient algorithms.

  14. Graphs: the lingua franca of Big Data Mathematical objects studied well before the history of computers. Königsberg’s bridges problem.Euler, 1735.

  15. Graphs: the lingua franca of Big Data Graphs are everywhere! Technological Networks Social Networks Natural Networks

  16. Graphs: the lingua franca of Big Data Formal definition B D A C A set of Nodes

  17. Graphs: the lingua franca of Big Data Formal definition B D A C A set of Edges

  18. Graphs: the lingua franca of Big Data Formal definition B 2 1 4 D A 3 C The edges might have a weight

  19. Adwords data as a (Bipartite) Graph Hundreds of Labels A lot ofAdvertisers Billions of Queries

  20. Semi-Formal Problem Definition Advertisers Queries

  21. Semi-Formal Problem Definition Advertisers A Queries

  22. Semi-Formal Problem Definition Advertisers A Queries Labels:

  23. Semi-Formal Problem Definition Advertisers A Queries Labels:

  24. Semi-Formal Problem Definition Advertisers A Queries Goal: Find the nodes most “similar” to A. Labels:

  25. How to Define Similarity? Several node similarity measures in the literature based on the graph structure, random walk, etc. • What is the accuracy? • Can it scale to graphs with billions of nodes? • Can be computed in real-time?

  26. The three ingredients of Big Data • A lot of data… • A sophisticated infrastructure: MapReduce • Efficient algorithms: Graph mining

  27. MapReduce

  28. MapReduce The work is spread across several machines in parallel connected with fast links.

  29. Algorithms Personalized PageRank: • Random walks on the graph • Closely related to the celebrated Google PageRank™.

  30. Personalized PageRank

  31. Personalized PageRank

  32. Personalized PageRank

  33. Personalized PageRank

  34. Personalized PageRank

  35. Personalized PageRank

  36. Personalized PageRank

  37. Personalized PageRank

  38. Personalized PageRank

  39. Personalized PageRank

  40. Personalized PageRank

  41. Personalized PageRank

  42. Personalized PageRank

  43. Personalized PageRank • Idea: perform a very long random walk (starting from v). • Rank nodes by probability of visit assigns asimilarity score to each node w.r.t. node v. • Strong community bias (this can be formalized).

  44. Personalized PageRank • Exact computation is unfeasible O(n^3), but it can be approximated very well. • Very efficient Map Reduce algorithm scaling to large graphs (hundred of millions of nodes) However…

  45. Algorithmic Bottleneck • Our graphs are simply too big (billions of nodes) even for large-scale systems. • MapReduce is notreal-time. • We cannot precompute the results for all subsets of categories (exponential time!).

  46. 1st idea: Tackling Real Graph Structure • Data size is the main bottleneck. • Compressingthe graph would speed up the computation.

  47. 1st idea: Tackling Real Graph Structure A B A 1 B a b c d e f g Only advertisers. Advertisers and queries

  48. 1st idea: Tackling Real Graph Structure A B A 1 B a b c d e f g Only advertisers. Advertisers and queries 2 B A g d a e b c f Ranking of the entire graph

  49. 1st idea: Tackling Real Graph Structure Theorem: the ranking computed is the corrected Personalized PageRank on the entire graph. Based on results from the mathematical theory Markov Chain state aggregation (Simon and Ado, ’61; Meyer ’89, etc.).

  50. Algorithmic Bottleneck • Our graphs are too big (billions of nodes) even for large-scale systems. • MapReduce is not real-time. • We cannot precompute the results for all subsets of categories (exponential time!).

More Related