1 / 85

To the Internet and Beyond: Database Challenges for New/Advanced Applications

To the Internet and Beyond: Database Challenges for New/Advanced Applications. May 21, 2001. Agenda. The story of Infoseek Why Propel? The problems that arise for us. Scoring Framework. Classification Problem Separate relevant from non-relevant documents

scott
Download Presentation

To the Internet and Beyond: Database Challenges for New/Advanced Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. To the Internet and Beyond: Database Challenges for New/Advanced Applications May 21, 2001 Propel Confidential

  2. Agenda • The story of Infoseek • Why Propel? • The problems that arise for us

  3. Scoring Framework • Classification Problem • Separate relevant from non-relevant documents • Bayes’ Decision rule: Relevant if P(x(d)|R)P(R)  P(x(d)|~R)P(R)where x(d) is the observed representation of d • Independence assumption leads toS(d) =  log [p(t)(1-q(t))/(1-p(t))q(t)]where p(t) = P(t|R) and q(t) = P(t|R)

  4. The original Infoseek vision • Stolen from Bill Gates… “Information at your fingertips” • To find any piece of information on any computer in the world within 1 second

  5. How we got started • Finding information was too expensive and too hard • Our field of dreams • “If you provide useful information at bargain prices, they will come” • In January 1995 we launched Infoseek • Register with a credit card • First month free • 10 cents a transaction

  6. What happened... • “I thought you said it was FREE to try it?” • “You’ve got to be kidding!” • “I already pay $10 a month for my access!” • “I can’t afford it.” • “Go to …” • “Why should I pay when the information is available free elsewhere on the net?” • “I don’t like to be nickeled and dimed.”

  7. Even more advice... • “You should only charge me per query” • “You should only charge for document” • “I’ll only sign up for a flat fee” • “I refuse to pay a flat fee” • “I don’t have a credit card” • “Your legal agreement is too long”

  8. What we did • Dropped the credit card registration for a free trial • Made it very clear you can’t get most of this stuff for free anywhere on the net • Made the pricing easier to understand • Advertised it on our free Net Search

  9. “So…. How would you like to provide a free Net Search?” • First reactions • “Are you joking?” • “How would we make money? By making it up in volume?” • Strategy • It would be free advertising for Pro • Limit the search results to 100 hits • Want more? Refer to Infoseek Pro

  10. Infoseek Guide • 25M hits/day (200 queries/sec at times) • #1 search engine on the Net • 1,000 signups/day for Infoseek Pro • Discovered advertising sponsorship • 1.5 cents per query • Discovered TV math • we make more money giving away information than selling it

  11. Four years later… Propel Confidential

  12. How to find Barney pagessuitable for your kids +Barney +dinosaur -bash -kill -maim -destroy -hate

  13. What people ask about (and why) Propel Confidential

  14. Unofficial SIGMOD survey question How many people here search the web for “adult sites”?

  15. sex Playboy Penthouse chat Hustler nude porn erotica games pornography porno adult ESPN pussy Pamela Anderson Top 15 queries on the WWW * I am not making this up! This list is real!

  16. What does that mean? • “Uhh… I was just testing!”

  17. Unofficial SIGMOD trivia question • Q: What famous IR researcher asked in 1995 “Is this because of the Communications Decency Act (CDA)?” • A: Bruce Croft

  18. Why it happens(possible explanations) • Research on CDA • Curious what others looking at • Many new technologies are driven by sex: • VCR • Hotel movies on demand • People are naturally horny

  19. What it means • Human race in no danger of extinction • Corporate libraries doing a great job in technical areas • Traditional sex education inadequate • Some of you are not telling the truth • Audience surveys are not always accurate • Bill Gates should admit to Congress that Pamela Anderson is more important than he is • If you didn’t raise your hand, you may need professional help!

  20. The secret Infoseek backup bizplan • Selling our list of porn sites

  21. We never pursued it… … But other companies did! • Sinfoseek • Infoseak • Nymfoseek • Infopeek • ...

  22. Relevance ranking Web sites Propel Confidential

  23. Facts about Queries • Most queries are short • Average length approx. 2.2 • 10% use query syntax (usually incorrectly) • 1% used advanced search • Noun phrases only • Precision more important than recall • Users expect precision in top results

  24. Relevance ranking objectives Must use several techniques to determine “relevance”: • Page has query term(s) • Popular usage of the term, e.g., penthouse, java, adult, “evil empire”, ... • Page quality • Page/site popularity • Spam reduction/elimination • Porn reduction

  25. Relevance ranking factors • Query terms: tf*idf • Usage: Hyperlink text, thesaurus • Quality: site quality, dates, depth, … • Popularity: External link count, proxy stats • Spam: word/phrase unusual statistics (tf limiting) • Porn: site exclusion list, naughty phrase list

  26. Relative weighting of these factors is tricky and subjective Should “evil empire” return Microsoft as the top hit?

  27. Living in a world of an infinite number of documents Propel Confidential

  28. The problem (user view) • Too hard to find things even though only 100M documents indexed • Often precision and relevance, NOT recall • “intel” in the title search gives over 200 hits just like this: Index of /CPAN-local/authors/id/GSAR/x86/intel/ix86/intel/ix86/intel/intel/ix86/intel/ix86/ • Query ambiguity, e.g., “baby Bells”

  29. The problem (vendor view) • Speed • Size • Cost • Freshness • Load on the Internet/bandwidth (both sides) • Quality (Spam/porn) • Will people be able to find what they are looking for as the net grows?

  30. Today’s approach sucks Suck all content into a centralized search engine Infoseek All the world’s content

  31. Is there a better way? • We might start by asking the question: “How do people find information today?”

  32. Centralized searching techniques are rarely used in real life... • Ask God (and pray for an answer) • Ask DIALOG …and pray... • WWW search (new!)

  33. What people DO use is decentralized searching Source 1 Source 2 Question ... Source N Answers and more sources

  34. Human distributed searching attributes • Faster than a computer!!! • Complete • Accurate • Can be used to validate an answer • Will always find an answer (eventually) • No specialized hardware • All humans had the same CPU speed/RAM

  35. So can’t we design a computer distributed search network that is as fast and accurate and complete as our human distributed search network?

  36. Our goal • Don’t necessarily mimic the process, but adapt the process to the medium

  37. One approach • User types query • System searches databases of popular pages as well as meta descriptions of other databases • Repeat until all websites have been searched NOTE: This is the fastest way to search an infinite amount of data

  38. What we learned • Relatively weak engines with no proximity got a wide following: people couldn’t see through the hype • Bigger was better

  39. People lie • We have “concept searching” • We’re growing faster than the net • We’ve indexed 95% of the net • We have more URLs than anyone else

  40. What we learned • Competing for the Internet customer is not always a case of who really has: • the best engine • the highest quality content or the most content • the best price, best GUI, or the best product • It’s more about: • brand name • convincing the customer you are the best

  41. What we learned • Ads don’t sell themselves • If you do 1M ads per day, you’re cookin’ • Lots of competition • Switching costs are low • User behavior can be tracked • Seemingly identical pages can have dramatically different click through

  42. Mistakes we made • Not pressing for branding • Slow to recognize ad model • No ! at the end of our name

  43. Ultra required new thinking • The traditional IDF formula breaks down for 1 Billion documents • Existing data structures would never work • “Managing Gigabytes” didn’t go far enough • Inktomi approach was too inefficient • No sacred cows

  44. Ultra: Designed for speed • Speed/space tradeoff • Architected from the ground up for 1Billion docs and 1,000 queries/sec • Everything is done in parallel and multi-threaded • Limited disk I/O • Small in-RAM tables • Stable connections

  45. Ultra size • Parallel worms (multi-process, multi-threaded) • Proprietary database required; OODB’s too slow • Change frequency monitored • >50M URLs

  46. Feature set • Natural language queries • +, -, “phrases” • Fields (link, url, site, title) • Case sensitivity • Stemming • Approximate matching for phrases • Gets faster the longer the query • 8 shortest term lists • Space invariant, I.e., CD-ROM =CDROM

More Related