1 / 32

Contextual Advertising by Combining Relevance with Click Feedback

Contextual Advertising by Combining Relevance with Click Feedback. Deepak Agarwal Joint work with Deepayan Chakrabarti & Vanja Josifovski Yahoo! Research WWW’08, Beijing, China 24 th April, 2008. Outline. Motivating Application, Challenges Contextual Advertising

morela
Download Presentation

Contextual Advertising by Combining Relevance with Click Feedback

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Contextual Advertising by Combining Relevance with Click Feedback Deepak Agarwal Joint work withDeepayan Chakrabarti & Vanja Josifovski Yahoo! Research WWW’08, Beijing, China 24th April, 2008

  2. Outline • Motivating Application, Challenges • Contextual Advertising • Semantic versus Predictive models • Pros, Cons • Our Approach: Blend Semantic with Predictive • Model Description • Logistic Regression, Feature Selection • Model structure amenable to fast scoring at run time • Experimental Results • Ongoing work

  3. Outline 1 Motivating Application, Background and Challenges

  4. Motivating Application • Problem: Match ads to queries • Sponsored Search: • The query is a short piece of text input by the user • User intent better expressed; less noisy • Contextual Advertising: • The query is a webpage • Generally long, noisy, user intent less clear • Harder matching problem

  5. Challenges • Serve ads to maximize revenue (CTR) • Serve most relevant ads in a given context • User Feedback in the form of Clicks in different context • Automation must for profitability • Billions of opportunities; millions of ads • High volume, low marginal cost →lucrative business • Automation through Algorithms/Models • Accuracy: Massive data; scalable procedures • Structure of Models: Scoring ads under strict latency requirements (~few ms)

  6. Classical Approach: Semantic • Serve Shoe ads on Shoe pages • Models: Information Retrieval • Get relevant docs (ads) for a query (webpage) • Simple vector space model • q=(t1,w1;…,tn,wn); a=(a1,v1;…,am,vm) • Cos(q,a) = s ε q ∩ awsas/(|q||a|) • w’s, a’s: tf-idf; • Frequency: reward in doc; penalize in corpus • Higher score →More relevance

  7. Pros Training: simple, scalable Vocabulary (stop-words; stemming); Corpus Serving with low latency evaluates millions of candidate ads in few ms Clever algorithms (Broder et al) Cons Does not always capture context Clicks? Better? Active user feedback Can we use it ? Semantic: Pros & Cons

  8. Predictive Approach: Clicks • New challenging research area • Learn from historic clicks on ads • Indicator of overall relevance • Rank ads by CTR = P(Click|Ad,context) • Estimating CTR difficult statistical problem • High-dim, sparseness (too many combinations) • (Page,Ad)→(Page Features, Ad Features) • Bias-Variance Tradeoff when selecting features • Coarse is stable but less precise; fine has high variance

  9. Statistical Challenges( contd) • Retrospective data biased • I never showed ads with word “Rolex” on pages with word “Golf”, how will I learn this match? • What is irrelevant? Labeling negatives. • I never click on ads no matter what • Good models maybe complex • Scalability while training (Grid computing helps) • Serving: All models are not index friendly • Quick evaluation during serve time improves system

  10. When Semantic meets Predictive • Semantic provides domain knowledge • Feature selection driven by semantic knowledge • Predictive “enhances” semantic • “correction” terms to semantic to match click feedback • fallback on semantic when signal weak • Model scalable (Grid computing) • Fast to evaluate at run time • Faster→More candidates evaluated at serve time • Accuracy versus Coverage

  11. Outline 2 Modeling Approach

  12. Predictive Regression model • Region specific splitting for page and ad • Page “regions”: • Title, headers, boldfaces, metadata, etc. • Ad “regions”: • title, body, etc • Features: words, phrases, classes in different regions. • Word matches in title more important that in the body • Illustration: word features; title regions • Extension to multiple regions, multiple feature types routine • Experiments to appear in a future version

  13. Logistic Regression: Word features • Model clicks/non-clicks: Logistic Regression • Training & test data: events with clicks only • yij~Ber(pij) Model parameters CTR Main effect for page (overall popularity) Main effect for ad (overall popularity) Interaction effect(words shared by page and ad) Gaussian priors on model parameters: penalizes sparse features

  14. Feature weights “correct” relevance • Mp,w = tfp,w1(w ε p) • Ma,w = tfa,w1(w ε a) • Ip,a,w = tfp,w * tfa,w1(w ε p) 1(w ε a) • So, IR-based term frequency measures are taken into account

  15. How to select words? • Word selection • Overall, nearly 110k words in our training data • Stop word removal, stemming • Learning parameters for each word would be: • Expensive, overfits • We use simple feature selection strategies • Select top-k

  16. Word Selection: data based • Define an interaction measure for each word • Higher values for words which have higher-than-expected CTR when they occur on both page and ad • Remove words served or clicked few times for robustness

  17. Word selection contd • Word selection: relevance based • Average tfidf score of each word : pages and ads • Higher values imply higher relevance • Ranked by geometric mean: tfidf on page and ad • Ranked by tfidf on page and ad; take the union

  18. Best Word Selection scheme • Word selection • Two methods • Data based • Relevance based • We picked the top 1000 words by each measure • Data-based methods give better results Precision Recall

  19. Semantic similarity score • Word features have low coverage; fallback mechanism to semantic similarity • Map cosine on logit scale? • Create score bins • 100 points per bin • Mean score vs logit(CTR) • Quadratic relationship logit(pij) Cosine score

  20. Incorporating similarity • Quadratic relationship used in two ways • Put in cosine and cosine2 as features • Add as offset: Prior log-odds • Similar Results

  21. Scalable Training • Fast Implementation • Training: Hadoop implementation of Logistic Regression Combine estimates Data Learned model params Random data splits Mean and Variance estimates Iterative Newton-Raphson

  22. Outline 3 Fast Evaluation at Serve Time

  23. Efficient Score Evaluation • Problem: For a page visit; select top-n ads using scoring formula • Why hard: Only a few ms; too many ads to evaluate • Rich literature in IR to solve this problem • Efficient solutions for vector space models through “posting lists” • <term, sorted list of doc IDs containing the term> • Interaction terms in regression model motivated by this • Document at a time (DAAT) strategy • Posting lists: sorted doc IDs for each query term • Evaluates each doc containing at least one query term one at a time • stop prematurely if clear doc can’t make it to top n • System sparse, few correlations; efficiency through approximations

  24. Efficient evaluation through two-stage procedure (Broder et al.) HEAP Top-n θ=min-score Approximate: x1*U1+x2*U2+x3*U3+x4*U4 > θ Doc Ids U1 1 3 x1 U2 x2 3 5 7 CurrDoc=1 U3 x3 2 3 U1+U2+U3 > θ U1 + U2 <= θ x4 7 U4 WAND Iterator traverses posting list very efficiently by skipping unnecessary docs Efficiency depends on Upper bounds for terms

  25. Efficiency of procedure • Efficiency through document skipping • Must be able to compute upper bounds quickly • Match scoring formula should not use arbitrary features • (“word X in query AND word Y in ad”) • Such pairwise (“cross-product”) checks may get costly • Large posting lists; too many evaluations • Upper bounds crucial to performance • Large→False +ve’s; Small→False –ve’s • We are using upper bounds recommended in literature • More efficient implementation subject of future research

  26. System Architecture: scoring at serve time • Fast Implementation • Testing • Main effect for ads is used in ordering of ads in postings list (static) • Interaction effect is used to modify the idf-table of words (static) • Main effect for pages does not play a role in ad serving (page is given) Building postings lists

  27. Outline 4 Experiments and Results, Summary and Ongoing Work

  28. Experiments Precision Recall 25% lift in precision at 10% recall

  29. Experiments Low recall region Precision Recall 25% lift in precision at 10% recall Computed precision-recall for several splits Results statistically significant

  30. Experiments • Increasing the number of words from 1000 to 3400 led to only marginal improvement • Diminishing returns • System already performs close to its limit, without needing more training • Changing the training time period changes the word list; we update our posting lists periodically

  31. Summary • Matching ads to pages challenging problem • We provide an approach that blends semantic similarity and predictive models in a scalable fashion • Our approach index friendly • Experimental results on large scale system shows significant improvement • We can only improve the relevance models

  32. Ongoing Work • Change in training data changes word set • Working on more robust word feature selection • Clustering words • Efficient indexing strategies through better upper bound estimates for WAND • Expanding feature sets to include neighborhoods of words in posting lists • Balance between accuracy and WAND efficiency • Isotonic regression on cosine similarity

More Related