analysing and modelling large scale enterprise data n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Analysing and Modelling Large-Scale Enterprise Data PowerPoint Presentation
Download Presentation
Analysing and Modelling Large-Scale Enterprise Data

Loading in 2 Seconds...

play fullscreen
1 / 46

Analysing and Modelling Large-Scale Enterprise Data - PowerPoint PPT Presentation


  • 153 Views
  • Uploaded on

Analysing and Modelling Large-Scale Enterprise Data. Thore Graepel Online Services and Advertising Group Microsoft Research Cambridge. Overview. Complex large-scale data in the enterprise What kind of data is available? What technologies are used? Tasks and enterprise-specific challenges?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Analysing and Modelling Large-Scale Enterprise Data' - arleen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
analysing and modelling large scale enterprise data

Analysing and Modelling Large-Scale Enterprise Data

Thore Graepel

Online Services and Advertising Group

Microsoft Research Cambridge

overview
Overview
  • Complex large-scale data in the enterprise
    • What kind of data is available?
    • What technologies are used?
    • Tasks and enterprise-specific challenges?
  • Methodology:
    • Bayesian Inference in Factor Graph Models
    • PQL: Using SQL to describe probability models
  • Applications:
    • Gamer Rating and Matchmaking: TrueSkill
    • Click-Through Rate Prediction: AdPredictor
    • Large-Scale Recommendations: Matchbox
complex data
Complex Data

Joint work with Tom Minka & Phillip Trelford

data sources at microsoft external
Data Sources at Microsoft (External)
  • Online Services Division
    • Web index
    • Search and Ad click logs (12-15 TB / day)
    • Hotmail, Instant messaging, Internet Explorer (100s million users)
    • MSN portal and Bing maps
  • Xbox Live Gaming Service
    • User transaction log data
    • Ranking and matchmaking data
    • Game instrumentation for user testing
data sources at microsoft internal
Data Sources at Microsoft (Internal)
  • Development and Software Instrumentation
    • Watson (customer feedback data)
    • Source depot (MS source code, e.g., Office, Windows)
    • Multilingual technical documentation
  • Business
    • Customer databases
    • Sales and Marketing
data intensive tasks at microsoft
Data-Intensive Tasks at Microsoft
  • Prediction of user behaviour and preferences
    • Improve web search
    • Improve targeting for advertising
    • Spam filtering and content prioritisation
  • Improve user experience
    • Matchmaking for games
    • Multi-modal user interfaces (Natal, speech)
  • Improve software development process
    • Improve productivity of developers
    • Analyse software for defects
technical infrastructure
Technical Infrastructure
  • Relational Databases/SQL
    • Great agility for analysis and reliability for business
    • Limited scalability
    • Need to import data into SQL
  • Windows HPC
    • Complex computations / fine grained parallelism
    • Need to move data to HPC cluster
  • Cosmos
    • Take the computation to the data
    • Super efficient stream based computations
cosmos architecture
Cosmos Architecture

SCOPE

DryadLINQ

Sputnik

Dryad

Cosmos

Cluster Machine

Cluster Machine

Cluster Machine

Cluster Machine

Stream

Stream

Stream

Stream

enterprise online specific challenges
Enterprise/Online specific challenges
  • Privacy
    • Privacy limit the ways in which data can be used
    • Interesting trade-offs (differential privacy)
  • Incentives
    • Data produced by self-interested agents
    • Need to design incentive compatible mechanisms
  • Exploration/Exploitation
    • Results of inference feed back into business process and determine future observations.
    • Need to aim at long-term benefits
factor graphs trees
Factor Graphs / Trees
  • Definition: Graphical representation of product structure of a function (Wiberg, 1996)
    • Nodes: = Factors = Variables
    • Edges: Dependencies of factors on variables.
  • Question:
    • What are the marginals of the function (all but one variable are summed out)?
    • What is the mode of the function?
factor graphs and bayesian inference
Factor Graphs and Bayesian Inference
  • Bayes’ law
  • Factorising prior
  • Factorising likelihood
  • Sum out latent variables

s1

s2

s

t1

t2

d

y

factor trees separation
Factor Trees: Separation

y

Observation: Sum of products becomes product of sums of all messages from neighbouring factors to variable!

f3(x,y)

v

w

x

f1(v,w)

f2(w,x)

z

f4(x,z)

messages from factors to variables
Messages: From Factors To Variables

y

Observation: Factors only need to sum out all their local variables!

f3(x,y)

w

x

f2(w,x)

z

f4(x,z)

messages from variables to factors
Messages: From Variables To Factors

y

Observation: Variables pass on the product of all incoming messages!

f3(x,y)

x

f2(w,x)

z

f4(x,z)

the sum product algorithm
The Sum-Product Algorithm
  • Three update equations (Aji & McEliece, 1997)
  • Update equations can be directly derived from the distributive law.
  • Efficient for messages in the exponential family.
  • Calculate all marginals at the same time.
approximate message passing
Approximate Message Passing
  • Problem: The exact messages from factors to variables may not be closed under products.
  • Solution: Approximate the marginal as well as possible in the sense of minimal KL divergence.
  • Expectation Propagation (Minka, 2001): Approximate the marginal by moment-matching resulting in
distributed message passing
Distributed Message Passing
  • Map-Reduce for IID data
    • Map: Nodes compute messages mfis from data yi and mfis
    • Reduce: Combine messages mfis into ps by multiplication
  • Caveats:
    • All approximate data factors need the incoming message msfi!
    • All messages mfis need to be stored if the same data point is considered multiple times

s

y1

y2

y3

slide19
PQL

Joint work with Ralf Herbrich & Jurgen Van Gael

pql i augmenting schemas
PQL I – Augmenting Schemas

People = AUGMENT DB.People ADD weight FLOAT

DB.People

People

weight

pql ii factor types
PQL II – Factor Types

Table 1

Table 2

Table 1

Table 1

pql iii single relation factors
PQL III – Single Relation Factors

FACTOR Normal(p.weight,75.0,25.0) FROM People p

People

People

pql iv cross relation factors
PQL IV – Cross Relation Factors

FACTOR Normal(g.weight, p.weight, 1.0)

FROM People p, DrVisit g

WHERE p.PersonID = g.PersonID

DrVisit

People

DrVisit

People

trueskill
TrueSkill™

Joint work with Tom Minka & Phillip Trelford

trueskill1

Given:

Match outcomes: Orderings among k teams consisting of n1, n2 , ..., nk players, respectively

Questions:

Skill si for each player such that

Global ranking among all players

Fair matches between teams of players

TrueSkill™
efficient approximate inference
Efficient Approximate Inference

Gaussian Prior Factors

s1

s2

s3

s4

Ranking Likelihood Factors

Fast and efficient approximate message passing using Expectation Propagation

t1

t2

t3

y12

y23

trueskill superfast convergence to true skills
TrueSkill: Superfast convergence to True Skills

40

35

30

25

Level

20

15

char (TrueSkill™)

10

SQLwildman (TrueSkill™)

char (Halo 2 Beta)

5

SQLwildman (Halo 2 Beta)

0

Games played

0

100

200

300

400

applications to online gaming

Leaderboard

Global ranking of all players

Matchmaking

For gamers: Most uncertain outcome

For inference: Most informative

Both are equivalent!

Applications to Online Gaming
adpredictor
AdPredictor

Joint work with Joaquin Quiñonero Candela, OnnoZoeter, Tom Borchert , Phillip Trelford

why predict probability of click

Display (according to expected revenue)

Charge (per click)

Why Predict Probability-of-Click?
  • Advantages of improved probability estimates:
    • Increase user satisfaction by better targeting
    • Fairer charges to advertisers
    • Increase revenue by showing ads with high click-thru rate

$1.00

* 10%

=$0.10

$0.80

$2.00

* 4%

=$0.08

$1.25

$0.10

* 50%

=$0.05

$0.05

adpredictor details
adPredictor Details

102.34.12.201

15.70.165.9

Client IP

221.98.2.187

92.154.3.86

P(pClick)

+

Match

Type

Exact Match

Broad Match

ML-1

Position

SB-1

SB-2

training algorithm in action
Training Algorithm in Action

w2

w1

+

s

c

No Click

Click

client ip mean variance
Client IP: Mean & Variance

Low clickers

High clickers

adpredictor in bing search engine
AdPredictor in Bing Search Engine
  • AdPredictor is now running 100% Paid Search traffic in Microsoft’s Bing Search Engine
  • Relevance and Click-Through Rate of Ads improved
  • Calibrated CTR prediction provides solid foundation for further improvements
  • AdPredictor explored for other tasks such as contextual and display advertising
matchbox
Matchbox

Joint work with David Stern and Ralf Herbrich

collaborative filtering
Collaborative Filtering

Items

1

2

3

4

5

6

Metadata?

A

B

Users

C

?

?

?

D

map sparse features to trait space
Map Sparse Features To ‘Trait’ Space

234566

34

456457

345

User ID

Item ID

13456

64

654777

5474

Male

Horror

Gender

Female

Movie Genre

Drama

Comedy

UK

Country

Documentary

USA

Height

1.2m

message passing for matchbox
Message Passing For Matchbox

u11

u21

v11

v21

u01

s1

t1

+

+

*

u12

u22

v12

v22

u02

s2

t2

+

+

*

+

r

Message update functions powered by Infer.net

user item taste space
User/Item Taste Space

‘Preference Cone’ for user 145035

conclusions1
Conclusions
  • Great variety of data sources and tasks
  • Challenges: privacy, incentives, exploration
  • Tools: SQL, No-SQL , HPC
  • Modelling platform (Factor Graphs & PQL):
    • Represent uncertainty
    • Composable models
    • Distributed, data-centric computation
  • Applications: TrueSkill, AdPredictor, Matchbox
  • Thanks!