1 / 31

Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff?

Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff?. Qiaozhu Mei † , Kenneth Church ‡ † University of Illinois at Urbana-Champaign ‡ Microsoft Research. Small. How Big is the Web? 5B? 20B? More? Less?.

janelle
Download Presentation

Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Entropy of Search Logs-How Big is the Web?- How Hard is Search? - With Personalization? With Backoff? Qiaozhu Mei†, Kenneth Church‡ † University of Illinois at Urbana-Champaign ‡ Microsoft Research

  2. Small How Bigis the Web?5B? 20B? More? Less? • What if a small cache of millions of pages • Could capture much of the value of billions? • Could a Big bet on a cluster in the clouds • Turn into a big liability? • Examples of Big Bets • Computer Centers & Clusters • Capital (Hardware) • Expense (Power) • Dev (Mapreduce, GFS, Big Table, etc.) • Sales & Marketing >> Production & Distribution

  3. Millions (Not Billions)

  4. Population Bound • With all the talk about the Long Tail • You’d think that the Web was astronomical • Carl Sagan: Billions and Billions… • Lower Distribution $$  Sell Less of More • But there are limits to this process • NetFlix: 55k movies (not even millions) • Amazon: 8M products • Vanity Searches: Infinite??? • Personal Home Pages << Phone Book < Population • Business Home Pages << Yellow Pages < Population • Millions, not Billions (until market saturates)

  5. It Will Take Decades to Reach Population Bound • Most people (and products) don’t have a web page (yet) • Currently, I can find famous people • (and academics) • but not my neighbors • There aren’t that many famous people • (and academics)… • Millions, not billions • (for the foreseeable future)

  6. If there is a page on the web, And no one sees it, Did it make a sound? How big is the web? Should we count “silent” pages That don’t make a sound? How many products are there? Do we count “silent” flops That no one buys? Equilibrium: Supply = Demand

  7. Demand Side Accounting • Consumers have limited time • Telephone Usage: 1 hour per line per day • TV: 4 hours per day • Web: ??? hours per day • Suppliers will post as many pages as consumers can consume (and no more) • Size of Web: O(Consumers)

  8. How Big is the Web? • Related questions come up in language • How big is English? • Dictionary Marketing • Education (Testing of Vocabulary Size) • Psychology • Statistics • Linguistics • Two Very Different Answers • Chomsky: language is infinite • Shannon: 1.25 bits per character How many words do people know? What is a word? Person? Know?

  9. Chomskian Argument: Web is Infinite • One could write a malicious spider trap • http://successor.aspx?x=0 http://successor.aspx?x=1http://successor.aspx?x=2 • Not just academic exercise • Web is full of benign examples like • http://calendar.duke.edu/ • Infinitely many months • Each month has a link to the next

  10. How Bigis the Web?5B? 20B? More? Less? • More (Chomsky) • http://successor?x=0 • Less (Shannon) MSN Search Log 1 month  x18 More Practical Answer Comp Ctr ($$$$)  Walk in the Park ($) Millions (not Billions) Cluster in Cloud  Desktop  Flash

  11. Entropy (H) • Difficulty of encoding information (a distr.) • Size of search space; difficulty of a task • H = 20  1 million items distributed uniformly • Powerful tool for sizing challenges and opportunities • How hard is search? • How much does personalization help?

  12. How Hard Is Search? • Traditional Search • H(URL | Query) • 2.8 (= 23.9 – 21.1) • Personalized Search • H(URL | Query, IP) • 1.2 (= 27.2 – 26.0) Personalization cuts H in Half!

  13. Difficulty of Queries • Easy queries (low H(URL|Q)): • google, yahoo, myspace, ebay, … • Hard queries (high H(URL|Q)): • dictionary, yellow pages, movies, “what is may day?”

  14. How Hard are Query Suggestions?The Wild Thing? C* Rice  Condoleezza Rice • Traditional Suggestions • H(Query) • 21 bits • Personalized • H(Query | IP) • 5 bits (= 26 – 21) Personalization cuts H in Half! Twice

  15. Personalization with Backoff • Ambiguous query: MSG • Madison Square Garden • Monosodium Glutamate • Disambiguate based on user’s prior clicks • When we don’t have data • Backoff to classes of users • Proof of Concept: • Classes defined by IP addresses • Better: • Market Segmentation (Demographics) • Collaborative Filtering (Other users who click like me)

  16. Backoff • Proof of concept: bytes of IP define classes of users • If we only know some of the IP address, does it help? Cuts H in half even if using the first two bytes of IP Some of the IP is better than none

  17. Backing Off by IP Sparse Data Missed Opportunity • Personalization with Backoff • λsestimated with EM and CV • A little bit of personalization • Better than too much • Or too little λ4: weights for first 4 bytes of IP λ3 : weights for first 3 bytes of IPλ2 : weights for first 2 bytes of IP ……

  18. Personalization with Backoff Market Segmentation • Traditional Goal of Marketing: • Segment Customers (e.g., Business v. Consumer) • By Need & Value Proposition • Need: Segments ask different questions at different times • Value: Different advertising opportunities • Segmentation Variables • Queries, URL Clicks, IP Addresses • Geography & Demographics (Age, Gender, Income) • Time of day & Day of Week

  19. Business Queries on Business Days Consumer Queries (Weekends & Every Day)

  20. Business Days v. Weekends:More Clicks and Easier Queries More Clicks Easier

  21. Day v.s. Night: More Queries, More Diversified Queries More clicks and diversified queries Less clicks, more unified queries

  22. Harder Queries at TV Time Harder queries Weekends are harder

  23. Conclusions: Millions (not Billions) • How Big is the Web? • Upper bound: O(Population) • Not Billions • Not Infinite • Shannon >> Chomsky • How hard is search? • Query Suggestions? • Personalization? • Cluster in Cloud ($$$$)  Walk-in-the-Park ($) Entropy is a great hammer

  24. Conclusions: Personalization with Backoff • Personalization with Backoff • Cuts search space (entropy) in half • Backoff  Market Segmentation • Example: Business v. Consumer • Need: Segments ask different questions at different times • Value: Different advertising opportunities • Demographics: • Partition by ip, day, hour, business/consumer query… • Future Work: • Model combinations of surrogate variables • Group users with similarity  collaborative search

  25. Thanks!

  26. Coverage Prediction Task:Historical Logs  Coverage of Future Demand • Training: Estimate Pr(x) • Given what we know today, • Estimate, Pr(x), tomorrow’s demand for url x • (There are an infinite set of urls x.) • Test: Score Pr(x) by Cross Entropy • Given tomorrow’s demand, x1…xk • Score ≡−log2 of geometric mean of Pr(x1) … Pr(xk) • One forecast is better than another • If it has better (less) cross entropy • Cross entropy  Entropy (H) • The score for the best possible forecast (that only God knows)

  27. Millions, Not Billions(Until Market Saturates) • Telephones are a Mature Market • Saturated • Universal Service is limited by population • Loops (telephone numbers) ≈ population • Everyone (and every business) is listed in phonebook • (unless they have opted out) • Web is Growth Market • Decades from saturation • When everybody and every product has a page • The number of pages will be bounded by the population • In the meantime, millions are good enough

  28. Smoothing (Cont.) • Use interpolation smoothing: where IPi is the first i bytes of an IP address. e.g., IP4= 156.111.188.243; IP2 = 156.111.*.* • Use one month’s search log (Jan 06) as training data, new incoming log (Feb 06) as testing sets • λi determined by EM algorithm maximizing the cross conditional entropy on test set.

  29. Cross Validation • IP in the future might not be seen in the history • But parts of it is seen in the history No personalization Complete personalization Cross Entropy: H(future | history) Personalization with backoff Knows at least two bytes Knows every byte

  30. Cross Validations • Weekends are harder to predict • Weekdays to predict weekdays >> weekends to predict weekdays • Day time to predict day time >> nights to predict day time

  31. Test data: weekdays and weekends Weekends are more difficult to predict Weekdays are easier to be predicted by history of weekdays Training data: weekdays and weekends Partition by Day-Week (CV)

More Related