1 / 46

Analysing and Modelling Large-Scale Enterprise Data

Analysing and Modelling Large-Scale Enterprise Data. Thore Graepel Online Services and Advertising Group Microsoft Research Cambridge. Overview. Complex large-scale data in the enterprise What kind of data is available? What technologies are used? Tasks and enterprise-specific challenges?

arleen
Download Presentation

Analysing and Modelling Large-Scale Enterprise Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysing and Modelling Large-Scale Enterprise Data Thore Graepel Online Services and Advertising Group Microsoft Research Cambridge

  2. Overview • Complex large-scale data in the enterprise • What kind of data is available? • What technologies are used? • Tasks and enterprise-specific challenges? • Methodology: • Bayesian Inference in Factor Graph Models • PQL: Using SQL to describe probability models • Applications: • Gamer Rating and Matchmaking: TrueSkill • Click-Through Rate Prediction: AdPredictor • Large-Scale Recommendations: Matchbox

  3. Complex Data Joint work with Tom Minka & Phillip Trelford

  4. Data Sources at Microsoft (External) • Online Services Division • Web index • Search and Ad click logs (12-15 TB / day) • Hotmail, Instant messaging, Internet Explorer (100s million users) • MSN portal and Bing maps • Xbox Live Gaming Service • User transaction log data • Ranking and matchmaking data • Game instrumentation for user testing

  5. Data Sources at Microsoft (Internal) • Development and Software Instrumentation • Watson (customer feedback data) • Source depot (MS source code, e.g., Office, Windows) • Multilingual technical documentation • Business • Customer databases • Sales and Marketing

  6. Data-Intensive Tasks at Microsoft • Prediction of user behaviour and preferences • Improve web search • Improve targeting for advertising • Spam filtering and content prioritisation • Improve user experience • Matchmaking for games • Multi-modal user interfaces (Natal, speech) • Improve software development process • Improve productivity of developers • Analyse software for defects

  7. Technical Infrastructure • Relational Databases/SQL • Great agility for analysis and reliability for business • Limited scalability • Need to import data into SQL • Windows HPC • Complex computations / fine grained parallelism • Need to move data to HPC cluster • Cosmos • Take the computation to the data • Super efficient stream based computations

  8. Cosmos Architecture SCOPE DryadLINQ Sputnik Dryad Cosmos Cluster Machine Cluster Machine Cluster Machine Cluster Machine Stream Stream Stream Stream

  9. Enterprise/Online specific challenges • Privacy • Privacy limit the ways in which data can be used • Interesting trade-offs (differential privacy) • Incentives • Data produced by self-interested agents • Need to design incentive compatible mechanisms • Exploration/Exploitation • Results of inference feed back into business process and determine future observations. • Need to aim at long-term benefits

  10. Factor Graphs

  11. Factor Graphs / Trees • Definition: Graphical representation of product structure of a function (Wiberg, 1996) • Nodes: = Factors = Variables • Edges: Dependencies of factors on variables. • Question: • What are the marginals of the function (all but one variable are summed out)? • What is the mode of the function?

  12. Factor Graphs and Bayesian Inference • Bayes’ law • Factorising prior • Factorising likelihood • Sum out latent variables s1 s2 s t1 t2 d y

  13. Factor Trees: Separation y Observation: Sum of products becomes product of sums of all messages from neighbouring factors to variable! f3(x,y) v w x f1(v,w) f2(w,x) z f4(x,z)

  14. Messages: From Factors To Variables y Observation: Factors only need to sum out all their local variables! f3(x,y) w x f2(w,x) z f4(x,z)

  15. Messages: From Variables To Factors y Observation: Variables pass on the product of all incoming messages! f3(x,y) x f2(w,x) z f4(x,z)

  16. The Sum-Product Algorithm • Three update equations (Aji & McEliece, 1997) • Update equations can be directly derived from the distributive law. • Efficient for messages in the exponential family. • Calculate all marginals at the same time.

  17. Approximate Message Passing • Problem: The exact messages from factors to variables may not be closed under products. • Solution: Approximate the marginal as well as possible in the sense of minimal KL divergence. • Expectation Propagation (Minka, 2001): Approximate the marginal by moment-matching resulting in

  18. Distributed Message Passing • Map-Reduce for IID data • Map: Nodes compute messages mfis from data yi and mfis • Reduce: Combine messages mfis into ps by multiplication • Caveats: • All approximate data factors need the incoming message msfi! • All messages mfis need to be stored if the same data point is considered multiple times s y1 y2 y3

  19. PQL Joint work with Ralf Herbrich & Jurgen Van Gael

  20. PQL as a Platform

  21. PQL I – Augmenting Schemas People = AUGMENT DB.People ADD weight FLOAT DB.People People weight

  22. PQL II – Factor Types Table 1 Table 2 Table 1 Table 1

  23. PQL III – Single Relation Factors FACTOR Normal(p.weight,75.0,25.0) FROM People p People People

  24. PQL IV – Cross Relation Factors FACTOR Normal(g.weight, p.weight, 1.0) FROM People p, DrVisit g WHERE p.PersonID = g.PersonID DrVisit People DrVisit People

  25. PQL as a Unifying Platform

  26. TrueSkill™ Joint work with Tom Minka & Phillip Trelford

  27. Given: Match outcomes: Orderings among k teams consisting of n1, n2 , ..., nk players, respectively Questions: Skill si for each player such that Global ranking among all players Fair matches between teams of players TrueSkill™

  28. Efficient Approximate Inference Gaussian Prior Factors s1 s2 s3 s4 Ranking Likelihood Factors Fast and efficient approximate message passing using Expectation Propagation t1 t2 t3 y12 y23

  29. TrueSkill: Superfast convergence to True Skills 40 35 30 25 Level 20 15 char (TrueSkill™) 10 SQLwildman (TrueSkill™) char (Halo 2 Beta) 5 SQLwildman (Halo 2 Beta) 0 Games played 0 100 200 300 400

  30. Leaderboard Global ranking of all players Matchmaking For gamers: Most uncertain outcome For inference: Most informative Both are equivalent! Applications to Online Gaming

  31. Trueskill in Xbox 360 and Halo 3

  32. AdPredictor Joint work with Joaquin Quiñonero Candela, OnnoZoeter, Tom Borchert , Phillip Trelford

  33. Display (according to expected revenue) Charge (per click) Why Predict Probability-of-Click? • Advantages of improved probability estimates: • Increase user satisfaction by better targeting • Fairer charges to advertisers • Increase revenue by showing ads with high click-thru rate $1.00 * 10% =$0.10 $0.80 $2.00 * 4% =$0.08 $1.25 $0.10 * 50% =$0.05 $0.05

  34. adPredictor Details 102.34.12.201 15.70.165.9 Client IP 221.98.2.187 92.154.3.86 P(pClick) + Match Type Exact Match Broad Match ML-1 Position SB-1 SB-2

  35. Training Algorithm in Action w2 w1 + s c No Click Click

  36. Client IP: Mean & Variance Low clickers High clickers

  37. UserAgent: Mean Posterior Effects

  38. AdPredictor in Bing Search Engine • AdPredictor is now running 100% Paid Search traffic in Microsoft’s Bing Search Engine • Relevance and Click-Through Rate of Ads improved • Calibrated CTR prediction provides solid foundation for further improvements • AdPredictor explored for other tasks such as contextual and display advertising

  39. Matchbox Joint work with David Stern and Ralf Herbrich

  40. Collaborative Filtering Items 1 2 3 4 5 6 Metadata? A B Users C ? ? ? D

  41. Map Sparse Features To ‘Trait’ Space 234566 34 456457 345 User ID Item ID 13456 64 654777 5474 Male Horror Gender Female Movie Genre Drama Comedy UK Country Documentary USA Height 1.2m

  42. Message Passing For Matchbox u11 u21 v11 v21 u01 s1 t1 + + * u12 u22 v12 v22 u02 s2 t2 + + * + r Message update functions powered by Infer.net

  43. User/Item Taste Space ‘Preference Cone’ for user 145035

  44. Applications

  45. Conclusions

  46. Conclusions • Great variety of data sources and tasks • Challenges: privacy, incentives, exploration • Tools: SQL, No-SQL , HPC • Modelling platform (Factor Graphs & PQL): • Represent uncertainty • Composable models • Distributed, data-centric computation • Applications: TrueSkill, AdPredictor, Matchbox • Thanks!

More Related