- By
**waite** - Follow User

- 126 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Challenges in Computational Advertising' - waite

Download Now**An Image/Link below is provided (as is) to download presentation**

Download Now

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Challenges in Computational Advertising

DeepayanChakrabarti(deepay@yahoo-inc.com)

Online Advertising Overview

Pick ads

Ads

Advertisers

Ad Network

Content

User

Examples:Yahoo, Google, MSN, RightMedia, …

Content Provider

Advertising Setting

- Graphical display ads
- Mostly for brand awareness
- Revenue based on number of impressions (not clicks)

Sponsored Search

Display

Content Match

Advertising Setting

- The user intent is unclear
- Revenue depends on number of clicks
- Query (webpage) is long and noisy

Sponsored Search

Display

Content Match

This presentation

- Content Match [KDD 2007]:
- How can we estimate the click-through rate (CTR) of an ad on a page?

CTR for ad j on page i

~109 pages

~106 ads

This presentation

- Estimating CTR for Content Match [KDD ‘07]
- Traffic Shaping for Display Advertising [EC ‘12]

Display ads

Article summary

click

Alternates

This presentation

- Estimating CTR for Content Match [KDD ‘07]
- Traffic Shaping for Display Advertising[EC ‘12]
- Recommend articles (not ads)
- need high CTR on article summaries
- + prefer articles on which under-delivering ads can be shown

This presentation

- Estimating CTR for Content Match [KDD ‘07]
- Traffic Shaping for Display Advertising [EC ‘12]
- Theoretical underpinnings[COLT ‘10 best student paper]
- Represent relationships as a graph
- Recommendation = Link Prediction
- Many useful heuristics exist
- Why do these heuristics work?

Goal: Suggest friends

Estimating CTR for Content Match

- Contextual Advertising
- Show an ad on a webpage (“impression”)
- Revenue is generated if a user clicks
- Problem: Estimate the click-through rate (CTR) of an ad on a page

CTR for ad j on page i

~109 pages

~106 ads

Estimating CTR for Content Match

- Why not use the MLE?
- Few (page, ad) pairs have N>0
- Very few have c>0 as well
- MLE does not differentiate between 0/10 and 0/100
- We have additional information: hierarchies

Estimating CTR for Content Match

- Use an existing, well-understood hierarchy
- Categorize ads and webpages to leaves of the hierarchy
- CTR estimates of siblings are correlated
- The hierarchy allows us to aggregate data
- Coarser resolutions
- provide reliable estimates for rare events
- which then influences estimation at finer resolutions

Estimating CTR for Content Match

Level 0

- Region= (page node, ad node)
- Region Hierarchy
- A cross-product of the page hierarchy and the ad hierarchy

Level i

Region

Ad classes

Page classes

Page hierarchy

Ad hierarchy

Estimating CTR for Content Match

- Our Approach
- Data Transformation
- Model
- Model Fitting

Data Transformation

- Problem:
- Solution: Freeman-Tukey transform
- Differentiates regions with 0 clicks
- Variance stabilization:

Model

- Goal: Smoothing across siblings in hierarchy[Huang+Cressie/2000]

Level i

Each region has a latent state Sr

yr is independent of the hierarchy given Sr

Sr is drawn from its parent Spa(r)

Sparent

latent

S3

S1

S4

Level i+1

S2

y1

y2

y4

observable

20

Model

- However, learning Wr, Vr and βrfor each region is clearly infeasible
- Assumptions:
- All regions at the same level ℓ sharethe same W(ℓ) and β(ℓ)
- Vr = V/Nr for some constant V, since

wr

Spa(r)

Sr

Vr

βr

yr

ur

Model

- Implications:
- determines degree of smoothing
- :
- Sr varies greatly from Spa(r)
- Each region learns its own Sr
- No smoothing
- :
- All Sr are identical
- A regression model on features ur is learnt
- Maximum Smoothing

wr

Spa(r)

Sr

Vr

βr

yr

ur

Model

- Implications:
- determines degree of smoothing
- Var(Sr) increases from root to leaf
- Better estimates at coarser resolutions

wr

Spa(r)

Sr

Vr

βr

yr

ur

Model

- Implications:
- determines degree of smoothing
- Var(Sr) increases from root to leaf
- Correlations among siblings atlevel ℓ:
- Depends only on level of least commonancestor

wr

Spa(r)

Sr

Vr

βr

)

yr

ur

) > Corr(

,

Corr(

,

Estimating CTR for Content Match

- Our Approach
- Data Transformation (Freeman-Tukey)
- Model (Tree-structured Markov Chain)
- Model Fitting

Model Fitting

- Fitting using a Kalman filtering algorithm
- Filtering: Recursively aggregate data from leaves to root
- Smoothing: Propagate information from root to leaves
- Complexity: linear in the number of regions, for both time and space

filtering

smoothing

Model Fitting

- Fitting using a Kalman filtering algorithm
- Filtering: Recursively aggregate data from leaves to root
- Smoothing: Propagates information from root to leaves
- Kalman filter requires knowledge of β, V, and W
- EM wrapped around the Kalman filter

filtering

smoothing

Experiments

- 503M impressions
- 7-level hierarchy of which the top 3 levels were used
- Zero clicks in
- 76% regions in level 2
- 95% regions in level 3
- Full dataset DFULL, and a 2/3 sample DSAMPLE

Experiments

- Estimate CTRs for all regions R in level 3 with zero clicks in DSAMPLE
- Some of these regions R>0 get clicks in DFULL
- A good model should predict higher CTRs for R>0 as against the other regions in R

Experiments

- We compared 4 models
- TS: our tree-structured model
- LM (level-mean): each level smoothed independently
- NS (no smoothing): CTR proportional to 1/Nr
- Random: Assuming |R>0| is given, randomly predict the membership of R>0 out of R

Experiments

- MLE=0 everywhere, since 0 clicks were observed
- What about estimated CTR?

Variability from coarser resolutions

Close to MLE for large N

Estimated CTR

Estimated CTR

Impressions

Impressions

No Smoothing (NS)

Our Model (TS)

Estimating CTR for Content Match

- We presented a method to estimate
- rates of extremely rare events
- at multiple resolutions
- under severe sparsity constraints
- Key points:
- Tree-structured generative model
- Extremely fast parameter fitting

Traffic Shaping

- Estimating CTR for Content Match [KDD ‘07]
- Traffic Shaping for Display Advertising [EC ‘12]
- Theoretical underpinnings [COLT ‘10 best student paper]

Traffic Shaping

Which article summary should be picked?

Ans:The one with highest expected CTR

Which ad should be displayed?

Ans:The ad that minimizes underdelivery

Article pool

Underdelivery

- Advertisers are guaranteed some impressions (say, 1M) over some time (say, 2 months)
- only to users matching their specs
- only when they visit certain types of pages
- only on certain positions on the page
- An underdelivering ad is one that is likely to miss its guarantee

Underdelivery

- How can underdelivery be computed?
- Need user traffic forecasts
- Depends on other ads in the system
- An ad-serving systemwill try to minimizeunder-delivery on thisgraph

Demand dj

Supply sℓ

j

ℓ

Forecasted impressions(user, article, position)

Ad inventory

Traffic Shaping

Which article summary should be picked?

Ans:The one with highest expected CTR

Which ad should be displayed?

Ans:The ad that minimizes underdelivery

Goal: Combine the two

Traffic Shaping

- Goal: Bias the article summary selection to
- reduce under-delivery
- but insignificant drop in CTR
- AND do this in real-time

Outline

- Formulation as an optimization problem
- Real-time solution
- Empirical results

Formulation

Ad delivery fraction φℓj

ℓ

j

Demand dj

Traffic shaping fraction wki

i

Supply sk

CTRcki

k

k:(user)

j:(ads)

i:(user, article)

ℓ:(user, article, position)“Fully Qualified Impression”

Goal: Infer traffic shaping fractions wki

Formulation

Traffic shaping fraction wki

A

CTRcki

- Full traffic shaping graph:
- All forecasted user traffic X all available articles
- arriving at the homepage,
- or directly on article page
- Goal: Infer wki
- But forced to infer φℓjas well

B

C

Full Traffic Shaping Graph

Formulation

sk

wki

cki

i

k

ℓ

j

underdelivery

(Satisfy demand constraints)

demand

Total user traffic flowing to j (accounting for CTR loss)

Formulation

i

k

ℓ

j

(Satisfy demand constraints)

(Bounds on traffic shaping fractions)

(Shape only available traffic)

(Ad delivery fractions)

Key Transformation

- This allows a reformulation solely in terms of new variables zℓj
- zℓj = fraction of supply that is shown ad j, assuming user always clicks article

Formulation

- Convex program can be solved optimally

Formulation

- But we have another problem
- At runtime, we must shape every incoming user without looking at the entire graph
- Solution:
- Periodically solve the convex problem offline
- Store a cache derived from this solution
- Reconstruct the optimal solution for each user at runtime, using only the cache

Outline

- Formulation as an optimization problem
- Real-time solution
- Empirical results

Real-time solution

Cache these

Reconstruct using these

All constraints can be expressed as constraints on σℓ

Real-time solution

i

1

Σzℓj

k

Shape depends on the cached duals αj

Ui

ℓ

j

3 KKT conditions

Li

σℓ

σℓ = 0 unless Σzℓj= maxℓΣzℓj

2

Σℓσℓ = constant for all i connected to k

3

Real-time solution

i

1

Σzℓj

k

Ui

ℓ

- Algo
- Initialize σℓ = 0
- Compute Σzℓj from (1)
- If constraints unsatisfied, increase σℓ while satisfying (2) and (3)
- Repeat
- Extract wki from zℓj

j

Li

σℓ

σℓ = 0 unless Σzℓj= maxℓΣzℓj

2

Σℓσℓ = constant for all i connected to k

3

Results

- Data:
- Historical traffic logs from April, 2011
- 25K user nodes
- Total supply weight > 50B impressions
- 100K ads
- We compare our model to a scheme that
- picks articles to maximize expected CTR, and
- picks ads to display via a separate greedy method

Lift in impressions

Nearly threefold improvement via traffic shaping

Lift in impressions delivered to underperforming ads

Fraction of traffic that is not shaped

Average CTR

CTR drop < 10%

Average CTR (as percentage of maximum CTR)

Fraction of traffic that is not shaped

Summary

- 3x underdelivery reduction with <10% CTR drop
- 2.6x reduction with 4% CTR drop
- Runtime application needs only a small cache

Traffic Shaping

- Estimating CTR for Content Match [KDD ‘07]
- Traffic Shaping for Display Advertising [EC ‘12]
- Theoretical underpinnings [COLT ‘10 best student paper]

Previous Empirical Studies*

Especially if the graph is sparse

How do we justify these observations?

Link prediction accuracy*

Random

Shortest Path

Common Neighbors

Adamic/Adar

Ensemble of short paths

*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

Link Prediction – Generative Model

Unit volume universe

Model:

- Nodes are uniformly distributed points in a latent space
- This space has a distance metric
- Points close to each other are likely to be connected in the graph
- Logistic distance function (Raftery+/2002)

Link Prediction – Generative Model

α determines the steepness

1

½

radius r

Model:

Nodes are uniformly distributed points in a latent space

This space has a distance metric

Points close to each other are likely to be connected in the graph

Higher probability of linking

- Link prediction ≈ find nearest neighbor who is not currently linked to the node.
- Equivalent to inferring distances in the latent space

Common Neighbors

- Pr2(i,j) = Pr(common neighbor|dij)

j

i

Product of two logistic probabilities, integrated over a volume determined by dij

Common Neighbors

- OPT = node closest to i
- MAX = node with max common neighbors with i
- Theorem:

w.h.p

dOPT ≤ dMAX≤ dOPT + 2[ε/V(1)]1/D

Link prediction by common neighbors is asymptotically optimal

Common Neighbors: Distinct Radii

j

k

- Node k has radius rk .
- ik if dik ≤ rk (Directed graph)
- rk captures popularity of node k
- “Weighted” common neighbors:
- Predict (i,j) pairs with highest Σ w(r)η(r)

i

m

rk

# common neighbors of radius r

Weight for nodes of radius r

Type 2 common neighbors

j

k

i

rk

Adamic/Adar

Presence of common neighbor is very informative

Absence is very informative

1/r

r is close to max radius

Real world graphs generally fall in this range

ℓ-hop Paths

- Common neighbors = 2 hop paths
- For longer paths:
- Bounds are weaker
- For ℓ’ ≥ℓwe need ηℓ’ >> ηℓto obtain similar bounds
- justifies the exponentially decaying weight given to longer paths by the Katz measure

Summary

- Three key ingredients
- Closer points are likelier to be linked.

Small World Model- Watts, Strogatz, 1998, Kleinberg 2001

- Triangle inequality holds

necessary to extend to ℓ-hop paths

- Points are spread uniformly at random

Otherwise properties will depend on location as well as distance

Summary

In sparse graphs, length 3 or more paths help in prediction.

Differentiating between different degrees is important

For large dense graphs, common neighbors are enough

Link prediction accuracy*

The number of paths matters, not the length

Random

Shortest Path

Common Neighbors

Adamic/Adar

Ensemble of short paths

*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

Conclusions

- Discussed three problems
- Estimating CTR for Content Match
- Combat sparsity by hierarchical smoothing
- Traffic Shaping for Display Advertising
- Joint optimization of CTR and underdelivery-reduction
- Optimal traffic shaping at runtime using cached duals
- Theoretical underpinnings
- Latent space model
- Link prediction ≈ finding nearest neighbors in this space

Other Work

- Computational Advertising
- Combining IR with click feedback
- Multi-armed bandits using hierarchies
- Online learning under finite ad lifetimes

- Web Search
- Finding Quicklinks
- Titles for Quicklinks
- Incorporating tweets into search results
- Website clustering
- Webpage segmentation
- Template detection
- Finding hidden query aspects

- Graph Mining
- Epidemic thresholds
- Non-parametric prediction in dynamic graphs
- Graph sampling
- Graph generation models
- Community detection

Model

- Goal: Smoothing across siblings in hierarchy
- Our approach:
- Each region has a latent state Sr
- yr is independent of hierarchy given Sr
- Sr is drawn from the parent region Spa(r)

Level i

Level i+1

73

Data Transformation

N * Var(MLE)

- Problem:
- Solution: Freeman-Tukey transform
- Differentiates regions with 0 clicks
- Variance stabilization:

MLE CTR

N * Var(yr)

Mean yr

Download Presentation

Connecting to Server..