1 / 22

Estimating the ImpressionRank of Web Pages

Estimating the ImpressionRank of Web Pages. Ziv Bar- Yossef Maxim Gurevich Google and Technion Technion. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A. Impressions and ImpressionRank. Impression of page/site x on a keyword w :

chinara
Download Presentation

Estimating the ImpressionRank of Web Pages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Estimating the ImpressionRank of Web Pages Ziv Bar-Yossef Maxim Gurevich Google and TechnionTechnion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA

  2. Impressions and ImpressionRank • Impression of page/site x on a keyword w: • A user sends w to a search engine • The search engine returns x as one of the results • The user sees the result x • ImpressionRank of x: • # of impressions of x • Within a certain time frame • Measure of page/site visibility in a search engine • Each result has an impression on the keyword “www 2009”: • www.2009.org • www2009.org/calls.html • www.loginconference.com • ...

  3. Popular Keyword Extraction • The Popular Keyword Extraction problem: • Input: web page x, intk • Output: k keywords on which x has the most impressions among all keywords • Example: x = www.johnmccain.com • sarahpalin • john mccain • cindymccain

  4. Motivation • Popularity rating of pages and sites • Site analytics • Enable site owners to determine their visibility in different search engines • Combine with traffic data to derive click-through rates • Compare to other sites • Keyword suggestions for online advertising • Social analysis • Search engine evaluation • Finding similar pages

  5. Internal Measurements of ImpressionRank and Popular Keyword Extraction • Search engines can compute both ImpressionRank and popular keywords based on their query logs • Query logs are not publicly released due to privacy concerns • Caveats: • Only search engines can do this • Non-transparent

  6. External Measurements of ImpressionRank and Popular Keyword Extraction Main cost measure: # of requests to the search engine and to the suggestion server ImpressionRank estimator / Popular keyword extractor Target page URL ImpressionRank / Popular Keywords

  7. Our Contributions • Reduce ImpressionRank Estimation to Popular Keyword Extraction • First external algorithm for popular keyword extraction • Accurate • Uses relatively few search engine requests • Applies to: • Single web pages (www.cnn.com) • Web sites (www.cnn.com/*) • Domains (*.cnn.com/*)

  8. Related Work • Keyword extraction [Frank et al 99, Turney 00, …] • Keyword suggestions (for online advertising) [Yih et al 06, Fuxman et al 08] • Query by Document [Yang et al 09] • Commercial traffic reporting [GoogleTrends, comScore, Nielsen, Compete]

  9. Roadmap • The naïve popular keyword extraction algorithm • The improved popular keyword extraction algorithm • Best-First Search • Experimental results

  10. Popular Keyword Extraction: The Naïve Algorithm • Recall problem: • Target page may have impressions on keywords that do not occur in its text • Efficiency problem: • 103 terms  109 3-term candidates Suggestion Server Search Engine • Verification procedure for keyword w: • Submit w to the search engine and the suggestion server • Verify that w returns the target page • Verify that the popularity of w > 0 [BG08] Term Extractor Candidate keyword generator Candidate Verifier … weather mp3 tag song … Popular Keywords Term Pool Target Page Candidate keyword TRIE Candidate keyword TRIE mp3 tag … mp3 …

  11. Popular Keyword Extraction: The Improved Algorithm Suggestion Server Search Engine Term Extractor Candidate keyword generator Best-First Search Candidate Verifier Target Page Term Pool Target Page Popular Keywords Similar Pages Candidate keyword TRIE Anchor Text

  12. Best-First Search Suggestion Server Search Engine Best-First Search Candidate Verifier • Goals: • Prune as many candidates as possible • Verify the most promising candidates first • Start with single term candidates • Score candidates • While not exceeded search engine request budget • w = top scoring candidate • Send w to the verifier • Decide whether to prune w • If not prune w • Expand w – generate and score the children of w Candidate keyword TRIE 3 5 … weather mp3 … … 8 tag song mp3

  13. Pruning • Pruning decision for keyword w: • Submit query inurl:<target url> w • If no results, prune w and all its descendants • Retrieve suggestions for w • If no results, prune w and all its descendants • Pruning eliminates the vast majority of candidates • A single search/suggestion request may eliminate thousands of candidates

  14. Scoring • The Best-First search algorithm considers only the top scoring candidates given the budget • Want to predict • Whether the search engine returns the target page on w • Whether w is a popular keyword • score(w) = tf(w)  idf(w)  popularity_score(w) • , , and : relative weights of the scoring components Predicts the popularity of w Predicts whether the search engine returns the target page on w

  15. How to Compute Candidate Scores • Every time the algorithm expands a keyword, it needs to compute scores for all its children • There could be thousands of such children • TF Score • Straightforward. No search requests needed. • IDF Score • Approximated based on an offline corpus. No search requests needed. • Popularity Score • [BarYossefGurevich 08]: Algorithm for estimating keyword popularity using the query suggestion service • Too costly: may use dozens of suggestion requests per estimate • We present a new algorithm that estimates popularity for all the children in bulk • Uses hundreds of suggestion requests to estimate the popularity of all the children • Estimates are less accurate

  16. Cheap Popularity Estimation • Input: a keyword w • Goal: Estimate popularity of all w’s children • Bucket children according to their first character • Estimate relative popularity of each bucket • Estimate the relative popularity within each bucket mp3_ Example: w = “mp3” children: “mp3 song”, “mp3 tag”, “mp3 table”, … a BG08 Popularity Estimator mp3 s mp3 t … … s t 5 6 4 mp3 song mp3 tag mp3 table Estimate of popularity_score(prefix) 5 2

  17. Popular Keyword Extraction Algorithm: Quality Analysis • Precision: 100% • All extracted keywords return the target page • Recall: do we miss some popular keywords? • More difficult to measure – no ground truth to compare to • Estimate lower bound on the recall • Google: recall > 90% • Yahoo!: recall = 70% - 80%

  18. Resource Usage • ~10000 suggestion server requests per page • ~1000 search engine requests per page • 85%(Google), 75%(Yahoo) after 25% of resources spent

  19. ImpressionRank of News Sites(March 2009) weather cnn bristolpalin news weather cnn video obama stimulus package new york times barackobama amazon movies barackobama

  20. ImpressionRank of Social Sites(March 2009)

  21. Conclusions • First external algorithms for • ImpressionRank estimation • Popular keyword extraction • Future work • Improve efficiency • Improve recall

  22. Thank You

More Related