Loading in 5 sec....

Challenges in Computational AdvertisingPowerPoint Presentation

Challenges in Computational Advertising

- By
**waite** - Follow User

- 126 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Challenges in Computational Advertising' - waite

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Challenges in Computational Advertising

DeepayanChakrabarti([email protected])

Online Advertising Overview

Pick ads

Ads

Advertisers

Ad Network

Content

User

Examples:Yahoo, Google, MSN, RightMedia, …

Content Provider

Advertising Setting

- Graphical display ads
- Mostly for brand awareness
- Revenue based on number of impressions (not clicks)

Sponsored Search

Display

Content Match

Advertising Setting

Sponsored Search

Display

Content Match

Text ads

Pick ads

Match ads to the content

Advertising Setting

- The user intent is unclear
- Revenue depends on number of clicks
- Query (webpage) is long and noisy

Sponsored Search

Display

Content Match

This presentation

- Content Match [KDD 2007]:
- How can we estimate the click-through rate (CTR) of an ad on a page?

CTR for ad j on page i

~109 pages

~106 ads

This presentation

- Estimating CTR for Content Match [KDD ‘07]
- Traffic Shaping for Display Advertising [EC ‘12]

Display ads

Article summary

click

Alternates

This presentation

- Estimating CTR for Content Match [KDD ‘07]
- Traffic Shaping for Display Advertising[EC ‘12]
- Recommend articles (not ads)
- need high CTR on article summaries
- + prefer articles on which under-delivering ads can be shown

This presentation

- Estimating CTR for Content Match [KDD ‘07]
- Traffic Shaping for Display Advertising [EC ‘12]
- Theoretical underpinnings[COLT ‘10 best student paper]
- Represent relationships as a graph
- Recommendation = Link Prediction
- Many useful heuristics exist
- Why do these heuristics work?

Goal: Suggest friends

Estimating CTR for Content Match

- Contextual Advertising
- Show an ad on a webpage (“impression”)
- Revenue is generated if a user clicks
- Problem: Estimate the click-through rate (CTR) of an ad on a page

CTR for ad j on page i

~109 pages

~106 ads

Estimating CTR for Content Match

- Why not use the MLE?
- Few (page, ad) pairs have N>0
- Very few have c>0 as well
- MLE does not differentiate between 0/10 and 0/100
- We have additional information: hierarchies

Estimating CTR for Content Match

- Use an existing, well-understood hierarchy
- Categorize ads and webpages to leaves of the hierarchy
- CTR estimates of siblings are correlated
- The hierarchy allows us to aggregate data

- Coarser resolutions
- provide reliable estimates for rare events
- which then influences estimation at finer resolutions

Estimating CTR for Content Match

Level 0

- Region= (page node, ad node)
- Region Hierarchy
- A cross-product of the page hierarchy and the ad hierarchy

Level i

Region

Ad classes

Page classes

Page hierarchy

Ad hierarchy

Estimating CTR for Content Match

- Our Approach
- Data Transformation
- Model
- Model Fitting

Data Transformation

- Problem:
- Solution: Freeman-Tukey transform
- Differentiates regions with 0 clicks
- Variance stabilization:

Model

- Goal: Smoothing across siblings in hierarchy[Huang+Cressie/2000]

Level i

Each region has a latent state Sr

yr is independent of the hierarchy given Sr

Sr is drawn from its parent Spa(r)

Sparent

latent

S3

S1

S4

Level i+1

S2

y1

y2

y4

observable

20

Model

- However, learning Wr, Vr and βrfor each region is clearly infeasible
- Assumptions:
- All regions at the same level ℓ sharethe same W(ℓ) and β(ℓ)
- Vr = V/Nr for some constant V, since

wr

Spa(r)

Sr

Vr

βr

yr

ur

Model

- Implications:
- determines degree of smoothing
- :
- Sr varies greatly from Spa(r)
- Each region learns its own Sr
- No smoothing

- :
- All Sr are identical
- A regression model on features ur is learnt
- Maximum Smoothing

wr

Spa(r)

Sr

Vr

βr

yr

ur

Model

- Implications:
- determines degree of smoothing
- Var(Sr) increases from root to leaf
- Better estimates at coarser resolutions

wr

Spa(r)

Sr

Vr

βr

yr

ur

Model

- Implications:
- determines degree of smoothing
- Var(Sr) increases from root to leaf
- Correlations among siblings atlevel ℓ:
- Depends only on level of least commonancestor

wr

Spa(r)

Sr

Vr

βr

)

yr

ur

) > Corr(

,

Corr(

,

Estimating CTR for Content Match

- Our Approach
- Data Transformation (Freeman-Tukey)
- Model (Tree-structured Markov Chain)
- Model Fitting

Model Fitting

- Fitting using a Kalman filtering algorithm
- Filtering: Recursively aggregate data from leaves to root
- Smoothing: Propagate information from root to leaves

- Complexity: linear in the number of regions, for both time and space

filtering

smoothing

Model Fitting

- Fitting using a Kalman filtering algorithm
- Filtering: Recursively aggregate data from leaves to root
- Smoothing: Propagates information from root to leaves

- Kalman filter requires knowledge of β, V, and W
- EM wrapped around the Kalman filter

filtering

smoothing

Experiments

- 503M impressions
- 7-level hierarchy of which the top 3 levels were used
- Zero clicks in
- 76% regions in level 2
- 95% regions in level 3

- Full dataset DFULL, and a 2/3 sample DSAMPLE

Experiments

- Estimate CTRs for all regions R in level 3 with zero clicks in DSAMPLE
- Some of these regions R>0 get clicks in DFULL
- A good model should predict higher CTRs for R>0 as against the other regions in R

Experiments

- We compared 4 models
- TS: our tree-structured model
- LM (level-mean): each level smoothed independently
- NS (no smoothing): CTR proportional to 1/Nr
- Random: Assuming |R>0| is given, randomly predict the membership of R>0 out of R

Experiments

- MLE=0 everywhere, since 0 clicks were observed
- What about estimated CTR?

Variability from coarser resolutions

Close to MLE for large N

Estimated CTR

Estimated CTR

Impressions

Impressions

No Smoothing (NS)

Our Model (TS)

Estimating CTR for Content Match

- We presented a method to estimate
- rates of extremely rare events
- at multiple resolutions
- under severe sparsity constraints

- Key points:
- Tree-structured generative model
- Extremely fast parameter fitting

Traffic Shaping

- Estimating CTR for Content Match [KDD ‘07]
- Traffic Shaping for Display Advertising [EC ‘12]
- Theoretical underpinnings [COLT ‘10 best student paper]

Traffic Shaping

Which article summary should be picked?

Ans:The one with highest expected CTR

Which ad should be displayed?

Ans:The ad that minimizes underdelivery

Article pool

Underdelivery

- Advertisers are guaranteed some impressions (say, 1M) over some time (say, 2 months)
- only to users matching their specs
- only when they visit certain types of pages
- only on certain positions on the page

- An underdelivering ad is one that is likely to miss its guarantee

Underdelivery

- How can underdelivery be computed?
- Need user traffic forecasts
- Depends on other ads in the system

- An ad-serving systemwill try to minimizeunder-delivery on thisgraph

Demand dj

Supply sℓ

j

ℓ

Forecasted impressions(user, article, position)

Ad inventory

Traffic Shaping

Which article summary should be picked?

Ans:The one with highest expected CTR

Which ad should be displayed?

Ans:The ad that minimizes underdelivery

Goal: Combine the two

Traffic Shaping

- Goal: Bias the article summary selection to
- reduce under-delivery
- but insignificant drop in CTR
- AND do this in real-time

Outline

- Formulation as an optimization problem
- Real-time solution
- Empirical results

Formulation

Ad delivery fraction φℓj

ℓ

j

Demand dj

Traffic shaping fraction wki

i

Supply sk

CTRcki

k

k:(user)

j:(ads)

i:(user, article)

ℓ:(user, article, position)“Fully Qualified Impression”

Goal: Infer traffic shaping fractions wki

Formulation

Traffic shaping fraction wki

A

CTRcki

- Full traffic shaping graph:
- All forecasted user traffic X all available articles
- arriving at the homepage,
- or directly on article page

- Goal: Infer wki
- But forced to infer φℓjas well

B

C

Full Traffic Shaping Graph

Formulation

sk

wki

cki

i

k

ℓ

j

underdelivery

(Satisfy demand constraints)

demand

Total user traffic flowing to j (accounting for CTR loss)

Formulation

i

k

ℓ

j

(Satisfy demand constraints)

(Bounds on traffic shaping fractions)

(Shape only available traffic)

(Ad delivery fractions)

Key Transformation

- This allows a reformulation solely in terms of new variables zℓj
- zℓj = fraction of supply that is shown ad j, assuming user always clicks article

Formulation

- Convex program can be solved optimally

Formulation

- But we have another problem
- At runtime, we must shape every incoming user without looking at the entire graph

- Solution:
- Periodically solve the convex problem offline
- Store a cache derived from this solution
- Reconstruct the optimal solution for each user at runtime, using only the cache

Outline

- Formulation as an optimization problem
- Real-time solution
- Empirical results

Real-time solution

Cache these

Reconstruct using these

All constraints can be expressed as constraints on σℓ

Real-time solution

i

1

Σzℓj

k

Shape depends on the cached duals αj

Ui

ℓ

j

3 KKT conditions

Li

σℓ

σℓ = 0 unless Σzℓj= maxℓΣzℓj

2

Σℓσℓ = constant for all i connected to k

3

Real-time solution

i

1

Σzℓj

k

Ui

ℓ

- Algo
- Initialize σℓ = 0
- Compute Σzℓj from (1)
- If constraints unsatisfied, increase σℓ while satisfying (2) and (3)
- Repeat
- Extract wki from zℓj

j

Li

σℓ

σℓ = 0 unless Σzℓj= maxℓΣzℓj

2

Σℓσℓ = constant for all i connected to k

3

Results

- Data:
- Historical traffic logs from April, 2011
- 25K user nodes
- Total supply weight > 50B impressions

- 100K ads

- We compare our model to a scheme that
- picks articles to maximize expected CTR, and
- picks ads to display via a separate greedy method

Lift in impressions

Nearly threefold improvement via traffic shaping

Lift in impressions delivered to underperforming ads

Fraction of traffic that is not shaped

Average CTR

CTR drop < 10%

Average CTR (as percentage of maximum CTR)

Fraction of traffic that is not shaped

Summary

- 3x underdelivery reduction with <10% CTR drop
- 2.6x reduction with 4% CTR drop
- Runtime application needs only a small cache

Traffic Shaping

- Estimating CTR for Content Match [KDD ‘07]
- Traffic Shaping for Display Advertising [EC ‘12]
- Theoretical underpinnings [COLT ‘10 best student paper]

Link Prediction

- Which pair of nodes {i,j} shouldbe connected?

Alice

Bob

Charlie

Goal: Recommend a movie

Previous Empirical Studies*

Especially if the graph is sparse

How do we justify these observations?

Link prediction accuracy*

Random

Shortest Path

Common Neighbors

Adamic/Adar

Ensemble of short paths

*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

Link Prediction – Generative Model

Unit volume universe

Model:

- Nodes are uniformly distributed points in a latent space
- This space has a distance metric
- Points close to each other are likely to be connected in the graph
- Logistic distance function (Raftery+/2002)

Link Prediction – Generative Model

α determines the steepness

1

½

radius r

Model:

Nodes are uniformly distributed points in a latent space

This space has a distance metric

Points close to each other are likely to be connected in the graph

Higher probability of linking

- Link prediction ≈ find nearest neighbor who is not currently linked to the node.
- Equivalent to inferring distances in the latent space

Common Neighbors

- Pr2(i,j) = Pr(common neighbor|dij)

j

i

Product of two logistic probabilities, integrated over a volume determined by dij

Common Neighbors

- OPT = node closest to i
- MAX = node with max common neighbors with i
- Theorem:

w.h.p

dOPT ≤ dMAX≤ dOPT + 2[ε/V(1)]1/D

Link prediction by common neighbors is asymptotically optimal

Common Neighbors: Distinct Radii

j

k

- Node k has radius rk .
- ik if dik ≤ rk (Directed graph)
- rk captures popularity of node k

- “Weighted” common neighbors:
- Predict (i,j) pairs with highest Σ w(r)η(r)

- ik if dik ≤ rk (Directed graph)

i

m

rk

# common neighbors of radius r

Weight for nodes of radius r

Type 2 common neighbors

j

k

i

rk

Adamic/Adar

Presence of common neighbor is very informative

Absence is very informative

1/r

r is close to max radius

Real world graphs generally fall in this range

ℓ-hop Paths

- Common neighbors = 2 hop paths
- For longer paths:
- Bounds are weaker
- For ℓ’ ≥ℓwe need ηℓ’ >> ηℓto obtain similar bounds
- justifies the exponentially decaying weight given to longer paths by the Katz measure

Summary

- Three key ingredients
- Closer points are likelier to be linked.
Small World Model- Watts, Strogatz, 1998, Kleinberg 2001

- Triangle inequality holds
necessary to extend to ℓ-hop paths

- Points are spread uniformly at random
Otherwise properties will depend on location as well as distance

- Closer points are likelier to be linked.

Summary

In sparse graphs, length 3 or more paths help in prediction.

Differentiating between different degrees is important

For large dense graphs, common neighbors are enough

Link prediction accuracy*

The number of paths matters, not the length

Random

Shortest Path

Common Neighbors

Adamic/Adar

Ensemble of short paths

*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

Conclusions

- Discussed three problems
- Estimating CTR for Content Match
- Combat sparsity by hierarchical smoothing

- Traffic Shaping for Display Advertising
- Joint optimization of CTR and underdelivery-reduction
- Optimal traffic shaping at runtime using cached duals

- Theoretical underpinnings
- Latent space model
- Link prediction ≈ finding nearest neighbors in this space

- Estimating CTR for Content Match

Other Work

- Computational Advertising
- Combining IR with click feedback
- Multi-armed bandits using hierarchies
- Online learning under finite ad lifetimes

- Web Search
- Finding Quicklinks
- Titles for Quicklinks
- Incorporating tweets into search results
- Website clustering
- Webpage segmentation
- Template detection
- Finding hidden query aspects

- Graph Mining
- Epidemic thresholds
- Non-parametric prediction in dynamic graphs
- Graph sampling
- Graph generation models
- Community detection

Model

- Goal: Smoothing across siblings in hierarchy
- Our approach:
- Each region has a latent state Sr
- yr is independent of hierarchy given Sr
- Sr is drawn from the parent region Spa(r)

Level i

Level i+1

73

Data Transformation

N * Var(MLE)

- Problem:
- Solution: Freeman-Tukey transform
- Differentiates regions with 0 clicks
- Variance stabilization:

MLE CTR

N * Var(yr)

Mean yr

Download Presentation

Connecting to Server..