1 / 86

Searching the Internet

Steve Kirsch Chairman Infoseek. Searching the Internet. Scoring Framework. Classification Problem Separate relevant from non-relevant documents Bayes’ Decision rule: Relevant if P(x(d)|R)P(R)  P(x(d)|~R)P(R) where x(d) is the observed representation of d

bud
Download Presentation

Searching the Internet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Steve Kirsch Chairman Infoseek Searching the Internet

  2. Scoring Framework • Classification Problem • Separate relevant from non-relevant documents • Bayes’ Decision rule: Relevant if P(x(d)|R)P(R)  P(x(d)|~R)P(R)where x(d) is the observed representation of d • Independence assumption leads toS(d) =  log [p(t)(1-q(t))/(1-p(t))q(t)]where p(t) = P(t|R) and q(t) = P(t|R)

  3. How Infoseek got started How Infoseek works How do I … ? What people ask about; Why Relevance ranking of web pages Searching an infinite number of web pages How people search Video of IR experts Top 7 habits of experts Distributed search fusion algorithm Why write a search engine in Java? Agenda

  4. How Infoseek got started (1994) • DIALOG was too hard and too expensive …so... Internet + Natural Language Query Engine + Low price “If you build it, they will come”

  5. What things look like today

  6. Popular questions • Do we sell placement? • How do I get to the top of the results? • How do I find good Barney pages?

  7. Placement • Based on popularity and statistics

  8. Why people can’t find you • Infinite number of web pages • >150M static pages • Chances of being found on the net are a lot less than being found in the phone book because: • there are a million times more web pages • these pages are not organized

  9. How do I get to the top of the search result listings? Pray

  10. How to find Barney pagessuitable for your kids +Barney +dinosaur -bash -kill -maim -destroy -hate

  11. (and why) What people ask about

  12. Unofficial SIGIR survey question How many people here search the web for “adult sites”?

  13. sex Playboy Penthouse chat Hustler nude porn erotica games pornography porno adult ESPN pussy Pamela Anderson Top 15 queries on the WWW Sony NightShot HandyCam * I am not making this up! This list is real!

  14. What does that mean? • “Uhh… I was just testing!”

  15. Unofficial SIGIR trivia question • Q: What famous IR researcher asked “Is this because of the CDA?” • A: Bruce Croft in 1995

  16. Why it happens(possible explanations) • Research on CDA • Curious what others looking at • “I read Playboy for the articles” • Many new technologies are driven by sex: • VCR • Hotel movies on demand • People are naturally horny

  17. What it means • Human race in no danger of extinction • Corporate libraries doing a great job in technical areas • Traditional sex education inadequate • Some of you are not telling the truth • Audience surveys are not always accurate • Bill Gates should admit to Congress that Pamela Anderson is more important than he is • If you didn’t raise your hand, you may need professional help!

  18. Finding “relevant” sites • Try these engines: • Sinfoseek • Infoseak • Nymfoseek • Infopeek • ...

  19. Relevance ranking Web sites

  20. Facts about Queries • Most queries are short • Average length approx. 2.2 • 10% use query syntax (usually incorrectly) • 1% used advanced search • Noun phrases only • Precision more important than recall • Users expect precision in top results

  21. Relevance ranking objectives Must use several techniques to determine “relevance”: • Page has query term(s) • Popular usage of the term, e.g., penthouse, java, adult, “evil empire”, ... • Page quality • Page/site popularity • Spam reduction/elimination • Porn reduction

  22. Relevance ranking factors • Query terms: tf*idf • Usage: Hyperlink text, thesaurus • Quality: site quality, dates, depth, … • Popularity: External link count, proxy stats • Spam: word/phrase unusual statistics (tf limiting) • Porn: site exclusion list, naughty phrase list

  23. Relative weighting of these factors is tricky and subjective Should “evil empire” return Microsoft as the top hit?

  24. Living in a world of an infinite number of documents

  25. The problem (user view) • Too hard to find things even though only 100M documents indexed • Often precision and relevance, NOT recall • “intel” in the title search gives over 200 hits just like this: Index of /CPAN-local/authors/id/GSAR/x86/intel/ix86/intel/ix86/intel/intel/ix86/intel/ix86/ • Query ambiguity, e.g., “baby Bells”

  26. The problem (vendor view) • Speed • Size • Cost • Freshness • Load on the Internet/bandwidth (both sides) • Quality (Spam/porn) • Will people be able to find what they are looking for as the net grows?

  27. Today’s approach sucks Suck all content into a centralized search engine Infoseek All the world’s content

  28. Is there a better way? • We might start by asking the question: “How do people find information today?”

  29. Today’s retrieval techniques Let’s observe two “information retrieval professionals” at work...

  30. Top 7 information retrieval techniques used by the “pros” Witchcraft Dead relatives CIA Psychic Powers Ouiji Board Magic 8 ball Space Aliens

  31. My favorite IR story • On crutches • Right leg in a brace • Nurse uses this highly sophisticated IR technique to determine where the problem was

  32. Centralized searching techniques are rarely used in real life... • Ask God (and pray for an answer) • Ask DIALOG …and pray... • WWW search (new!)

  33. What people DO use is decentralized searching Source 1 Source 2 Question ... Source N Answers and more sources

  34. How well does centralized searching work? • I need a volunteer...

  35. How well does human distributed searching work? • Let’s find out… • Name two films directed by James Cameron

  36. Human distributed searching attributes • Faster than a computer!!! • Complete • Accurate • Can be used to validate an answer • Will always find an answer (eventually) • No specialized hardware • All humans had the same CPU speed/RAM

  37. So can’t we design a computer distributed search network that is as fast and accurate and complete as our human distributed search network?

  38. Our goal • Don’t necessarily mimic the process, but adapt the process to the medium

  39. One approach • User types query • System queries a “meta” database of collection descriptors to determine best collections for that query • System routes query to the best N of those collections in parallel • System merges results and presents to user • For deeper search, use bigger N

  40. Distributed search query • Subject area (optional): • Infoseek • Query: • Steve Kirsch • Collection selection: Use BOTH • Send only the Query to each collection

  41. Results Query Distributed searching demo Merge results and present User’s Java applet Best Collections Query Internet ... DB #34 DB #564 DB #54 Meta index

  42. Distributed vs. meta searching Meta Distributed Auto collection Fusion Distributed= “As if one collection”

  43. The meta index • The meta index contains characterizations of each collection • words and phrases • The meta index can be updated incrementally, in real-time, whenever a document is added/removed from a collection

  44. Results fusion issue • Search results scores from each collection are not comparable • even if each site is using the same search engine

  45. Traditional fusion technique • Pass 1: Gather statistics from all engines • Pass 2: Ask each engine to compute scores Requirement All search engines much use the same scoring algorithm

More Related