entropy of search logs how big is the web how hard is search with personalization with backoff l.
Skip this Video
Loading SlideShow in 5 Seconds..
Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff? PowerPoint Presentation
Download Presentation
Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff?

Loading in 2 Seconds...

play fullscreen
1 / 31

Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff? - PowerPoint PPT Presentation

  • Uploaded on

Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff?. Qiaozhu Mei † , Kenneth Church ‡ † University of Illinois at Urbana-Champaign ‡ Microsoft Research. Small. How Big is the Web? 5B? 20B? More? Less?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff?' - janelle

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
entropy of search logs how big is the web how hard is search with personalization with backoff

Entropy of Search Logs-How Big is the Web?- How Hard is Search? - With Personalization? With Backoff?

Qiaozhu Mei†, Kenneth Church‡

† University of Illinois at Urbana-Champaign

‡ Microsoft Research

how big is the web 5b 20b more less


How Bigis the Web?5B? 20B? More? Less?
  • What if a small cache of millions of pages
    • Could capture much of the value of billions?
  • Could a Big bet on a cluster in the clouds
    • Turn into a big liability?
  • Examples of Big Bets
    • Computer Centers & Clusters
      • Capital (Hardware)
      • Expense (Power)
      • Dev (Mapreduce, GFS, Big Table, etc.)
    • Sales & Marketing >> Production & Distribution
population bound
Population Bound
  • With all the talk about the Long Tail
    • You’d think that the Web was astronomical
    • Carl Sagan: Billions and Billions…
  • Lower Distribution $$  Sell Less of More
  • But there are limits to this process
    • NetFlix: 55k movies (not even millions)
    • Amazon: 8M products
    • Vanity Searches: Infinite???
      • Personal Home Pages << Phone Book < Population
      • Business Home Pages << Yellow Pages < Population
  • Millions, not Billions (until market saturates)
it will take decades to reach population bound
It Will Take Decades to Reach Population Bound
  • Most people (and products) don’t have a web page (yet)
  • Currently, I can find famous people
      • (and academics)
      • but not my neighbors
    • There aren’t that many famous people
      • (and academics)…
    • Millions, not billions
      • (for the foreseeable future)
equilibrium supply demand
If there is a page on the web,

And no one sees it,

Did it make a sound?

How big is the web?

Should we count “silent” pages

That don’t make a sound?

How many products are there?

Do we count “silent” flops

That no one buys?

Equilibrium: Supply = Demand
demand side accounting
Demand Side Accounting
  • Consumers have limited time
    • Telephone Usage: 1 hour per line per day
    • TV: 4 hours per day
    • Web: ??? hours per day
  • Suppliers will post as many pages as consumers can consume (and no more)
  • Size of Web: O(Consumers)
how big is the web
How Big is the Web?
  • Related questions come up in language
  • How big is English?
    • Dictionary Marketing
    • Education (Testing of Vocabulary Size)
    • Psychology
    • Statistics
    • Linguistics
  • Two Very Different Answers
    • Chomsky: language is infinite
    • Shannon: 1.25 bits per character

How many words do people know?

What is a word? Person? Know?

chomskian argument web is infinite
Chomskian Argument: Web is Infinite
  • One could write a malicious spider trap
    • http://successor.aspx?x=0 http://successor.aspx?x=1http://successor.aspx?x=2
  • Not just academic exercise
  • Web is full of benign examples like
    • http://calendar.duke.edu/
    • Infinitely many months
    • Each month has a link to the next
how big is the web 5b 20b more less10
How Bigis the Web?5B? 20B? More? Less?
  • More (Chomsky)
    • http://successor?x=0
  • Less (Shannon)

MSN Search Log

1 month  x18

More Practical Answer

Comp Ctr ($$$$)  Walk in the Park ($)


(not Billions)

Cluster in Cloud  Desktop  Flash

entropy h
Entropy (H)
  • Difficulty of encoding information (a distr.)
    • Size of search space; difficulty of a task
  • H = 20  1 million items distributed uniformly
  • Powerful tool for sizing challenges and opportunities
    • How hard is search?
    • How much does personalization help?
how hard is search
How Hard Is Search?
  • Traditional Search
    • H(URL | Query)
    • 2.8 (= 23.9 – 21.1)
  • Personalized Search
    • H(URL | Query, IP)
    • 1.2 (= 27.2 – 26.0)

Personalization cuts H in Half!

difficulty of queries
Difficulty of Queries
  • Easy queries (low H(URL|Q)):
    • google, yahoo, myspace, ebay, …
  • Hard queries (high H(URL|Q)):
    • dictionary, yellow pages, movies, “what is may day?”
how hard are query suggestions the wild thing c rice condoleezza rice
How Hard are Query Suggestions?The Wild Thing? C* Rice  Condoleezza Rice
  • Traditional Suggestions
    • H(Query)
    • 21 bits
  • Personalized
    • H(Query | IP)
    • 5 bits (= 26 – 21)

Personalization cuts H in Half!


personalization with backoff
Personalization with Backoff
  • Ambiguous query: MSG
    • Madison Square Garden
    • Monosodium Glutamate
  • Disambiguate based on user’s prior clicks
  • When we don’t have data
    • Backoff to classes of users
  • Proof of Concept:
    • Classes defined by IP addresses
  • Better:
    • Market Segmentation (Demographics)
    • Collaborative Filtering (Other users who click like me)
  • Proof of concept: bytes of IP define classes of users
  • If we only know some of the IP address, does it help?

Cuts H in half even if using the first two bytes of IP

Some of the IP is better than none

backing off by ip
Backing Off by IP

Sparse Data

Missed Opportunity

  • Personalization with Backoff
  • λsestimated with EM and CV
  • A little bit of personalization
    • Better than too much
    • Or too little

λ4: weights for first 4 bytes of IP λ3 : weights for first 3 bytes of IPλ2 : weights for first 2 bytes of IP


personalization with backoff market segmentation
Personalization with Backoff Market Segmentation
  • Traditional Goal of Marketing:
    • Segment Customers (e.g., Business v. Consumer)
    • By Need & Value Proposition
      • Need: Segments ask different questions at different times
      • Value: Different advertising opportunities
  • Segmentation Variables
    • Queries, URL Clicks, IP Addresses
    • Geography & Demographics (Age, Gender, Income)
    • Time of day & Day of Week

Business Queries on Business Days

Consumer Queries

(Weekends & Every Day)

day v s night more queries more diversified queries
Day v.s. Night: More Queries, More Diversified Queries

More clicks and diversified queries

Less clicks, more unified queries

harder queries at tv time
Harder Queries at TV Time

Harder queries

Weekends are harder

conclusions millions not billions
Conclusions: Millions (not Billions)
  • How Big is the Web?
    • Upper bound: O(Population)
      • Not Billions
      • Not Infinite
  • Shannon >> Chomsky
    • How hard is search?
    • Query Suggestions?
    • Personalization?
  • Cluster in Cloud ($$$$)  Walk-in-the-Park ($)

Entropy is a great hammer

conclusions personalization with backoff
Conclusions: Personalization with Backoff
  • Personalization with Backoff
    • Cuts search space (entropy) in half
    • Backoff  Market Segmentation
      • Example: Business v. Consumer
        • Need: Segments ask different questions at different times
        • Value: Different advertising opportunities
  • Demographics:
    • Partition by ip, day, hour, business/consumer query…
  • Future Work:
    • Model combinations of surrogate variables
    • Group users with similarity  collaborative search
prediction task historical logs coverage of future demand


Prediction Task:Historical Logs  Coverage of Future Demand
  • Training: Estimate Pr(x)
    • Given what we know today,
    • Estimate, Pr(x), tomorrow’s demand for url x
    • (There are an infinite set of urls x.)
  • Test: Score Pr(x) by Cross Entropy
    • Given tomorrow’s demand, x1…xk
    • Score ≡−log2 of geometric mean of Pr(x1) … Pr(xk)
  • One forecast is better than another
    • If it has better (less) cross entropy
  • Cross entropy  Entropy (H)
    • The score for the best possible forecast (that only God knows)
millions not billions until market saturates
Millions, Not Billions(Until Market Saturates)
  • Telephones are a Mature Market
    • Saturated
      • Universal Service is limited by population
      • Loops (telephone numbers) ≈ population
      • Everyone (and every business) is listed in phonebook
        • (unless they have opted out)
  • Web is Growth Market
    • Decades from saturation
    • When everybody and every product has a page
      • The number of pages will be bounded by the population
    • In the meantime, millions are good enough
smoothing cont
Smoothing (Cont.)
  • Use interpolation smoothing:

where IPi is the first i bytes of an IP address.

e.g., IP4=; IP2 = 156.111.*.*

  • Use one month’s search log (Jan 06) as training data, new incoming log (Feb 06) as testing sets
  • λi determined by EM algorithm maximizing the cross conditional entropy on test set.
cross validation
Cross Validation
  • IP in the future might not be seen in the history
  • But parts of it is seen in the history

No personalization

Complete personalization

Cross Entropy:

H(future | history)

Personalization with backoff

Knows at least two bytes

Knows every byte

cross validations
Cross Validations
  • Weekends are harder to predict
  • Weekdays to predict weekdays >> weekends to predict weekdays
  • Day time to predict day time >> nights to predict day time
partition by day week cv

Test data: weekdays and weekends

Weekends are more difficult to predict

Weekdays are easier to be predicted by history of weekdays

Training data: weekdays and weekends

Partition by Day-Week (CV)