1 / 51

Computer Science 1000

Computer Science 1000. Information Searching II. Permission to redistribute these slides is strictly prohibited without permission. Search Engine a collection of computer programs designed to help us find information on the Web typically served through a website

jaron
Download Presentation

Computer Science 1000

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

  2. Search Engine • a collection of computer programs designed to help us find information on the Web • typically served through a website • different search providers exist, but basic functionality is consistent • type keywords into a text box • page returns links to other pages

  3. Search Engine • why is a search engine like an index? • recall that an index maps keywords to a location in some medium (like a page number in a book) • a search engine does a very similar thing • takes keywords of interest from a user • maps these keywords to relevant web pages • in fact, one of the key components of a search engine is its index

  4. Search Engine • what differentiates a search engine from other indexes (like a book index)? • the ability to quickly combine keywords in searches • e.g. search for information on ducks and foxes • result ranking • personalization • among others …

  5. Search Engine – How it Works • different search engines employ different technologies • the full details of commercial search engines are typically not public • however, some of the basics are consistent • crawling • indexing • query processing

  6. Crawling • for a search engine to be able to link to a web page, it must know about its existence • search engines find pages by crawling the web • programs called crawlers or spiders • e.g. Googlebot • a crawler visits web pages, in much the same way that you do • as each page is visited, information is remembered about the page (indexing)

  7. Crawling – Todo List • the todo list is a list of pages that are visited by the crawler • the crawling process starts with an initial to-do list, populated with sites from previous crawls • however, the list is updated as the crawl takes place • hyperlinks on visited sites are added to the list Todo List http://www.uleth.ca http://www.tsn.ca http://www.usask.ca ...

  8. Crawling – Example • suppose that this page was being processed by a crawler • as a consequence of this page being crawled, its links would be added to the todo list (if they aren't already there) • those pages would subsequently be checked by the crawler at some point Kev's Page • Favorite Stuff: • New York Islanders • Saskatchewan Roughriders • John Deere

  9. The "Invisible Web" • not all information is crawled, which means it are not visible to search engines • some pages are new, and haven't yet had a chance to be crawled • however, there are other reasons that certain information does not get crawled

  10. The "Invisible Web" • 1) No hyperlinks to that page • recall that in order for a page to be crawled, it must be: • on the todo list • be linked to a page that appears on the todo list • without a hyperlink, that page will never be found Todo List Web pages Page 1 Page 2 Page 3 Page 1 Page 4 Page 2 Page 3 Page 6 Page 5 will not be crawled, as it is not on the to-do list, and no other pages link to it. Page 4 Page 5 Page 6

  11. The "Invisible Web" • 2) The Page is synthetic • a synthetic page is created on demand, depending on user input • e.g. the results of a search on another search engine My personal search for "New York Islanders" on Bing results in an on-demand page that is not stored. Hence, it will not be crawled.

  12. The "Invisible Web" • 3) The content is unreadable to the crawler • search engines are primarily text-based • certain data, such as movie content, is not crawlable The webpage containing the movie might be crawled, but not the movie itself. http://support.google.com/webmasters/bin/answer.py?hl=en&answer=72746

  13. The "Invisible Web" • 4) The content is password-protected • if you require a password to access a page, then so does a search engine*

  14. The "Invisible Web" • 5) You ask the search engine to ignore your site • the presence of certain files stored with your website will restrict your site from being crawled • e.g. The Robots Exclusion Protocol • a file called robots.txt can be stored that will request that your site (or just certain pages) are not indexed • unlike the previous four examples, this does not prevent search engines from crawling your site • they can choose to ignore robots.txt Example: User-agent: Google Disallow: User-agent: * Disallow: / http://www.robotstxt.org/

  15. Indexing • the primary role of the crawler is to build an index • an index is a list of tokens • words • phrases (not considered here)* • each token is associated with a list of URLs • in other words, like a book index, but with page URLs instead of page numbers • other information might be stored with URLs (e.g. page location of token) • these indexes are saved by the search provider • search queries use information from the indexes (fast), rather than crawling the web for each query (slow) *http://www.google.com/patents/US7536408

  16. Index Lists – Example * from text – Figure number might be different

  17. Indexing – What Makes a Token? • page text • a common approach • search providers differ on which text is selected* • some may use all text • others may only use certain text, such as: • titles and headings • frequently occuring words • words occuring early in a page • sometimes, stop words (a, an, the) are ignored • hyperlink text • the term from a hyperlink on another page may be used to describe the page that it links to *http://computer.howstuffworks.com/internet/basics/search-engine1.htm

  18. Query Processing • the part of the search engine that we see • the query processor: • reads words/phrases from the user interface • returns pages that are relevant to that query • modern query processors: • are extremely fast • are very accurate • allow a considerable variety in their capabilities • how does this all work?

  19. Query Processing – How it works • let's start simple: suppose we search for a single word (e.g. cat) • in a nutshell: • the search engine finds the list for the token 'cat' • contains list of pages that contain 'cat' in the appropriate text (e.g. title) • this list is ranked according to perceived relevance • the ranked list is returned as an ordered set of hyperlinks

  20. Query Processing – How it works • Step 1: the search engine finds the list for the token 'cat'

  21. Query Processing – How it works • Step 2: this list is ranked according to perceived relevance www.cat.com en.wikipedia.org/wiki/Cat www.youtube.com/watch?v=J---aiyznGQ ...

  22. Query Processing – How it works • Step 3: the ranked list is returned as an ordered set of hyperlinks www.cat.com en.wikipedia.org/wiki/Cat www.youtube.com/watch?v=J---aiyznGQ ...

  23. Query Processing • what about multi-word searching? • as mentioned, some search engines index phrases as well • however, what if a particular phrase is not indexed? • e.g. (text) red fish guppy • solution: intersecting queries • the webpages that are common to all of the search words are returned

  24. Intersecting Queries • example (text): suppose the query was “red fish guppy” • further suppose that the indexes for each word were as follows: • result is the set of sites that contain all of the keywords • in other words, the sites that are found on all three lists guppy: en.wikipedia.org/wiki/guppy www.ifga.org www.fullredguppy.com www.sciencedaily.com www.tropicalfish.com red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com guppy: en.wikipedia.org/wiki/guppy www.ifga.org www.fullredguppy.com www.sciencedaily.com www.tropicalfish.com red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com Result: www.fullredguppy.com www.sciencedaily.com

  25. Intersecting Queries - Efficiency • the size of index lists can be large • 'cat' returns over 2.3 billion results • modern search engines are fast • hence, clever algorithms must be developed for optimizing queries • example: intersecting queries

  26. Intersecting Queries - Efficiency • suppose you had two search terms • e.g. red and fish • the query processor has a list for tokens • suppose each list contained 1 billion tokens • let's consider a method for performing the intersecting query • that is, how do we find all pages that occur on both lists?

  27. The Naive Approach • for each entry in the 'red' list • search through the entire 'fish' list • if we find the entry from the red list, then add that to our result red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result:

  28. The Naive Approach • First search: www.sciencedaily.com • do we find it in second list? • yes – add it to result red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: www.sciencedaily.com

  29. The Naive Approach • Second search: en.wikipedia.org/wiki/red • do we find it in second list? • no red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: www.sciencedaily.com

  30. The Naive Approach • Third search: newsroom.urc.edu • do we find it in second list? • yes, add it to list red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: www.sciencedaily.com newsroom.urc.edu

  31. The Naive Approach • Fourth search: www.red.com • do we find it in second list? • no red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: www.sciencedaily.com newsroom.urc.ed

  32. The Naive Approach • Fifth search: www.fullredguppy.com • do we find it in second list? • yes – add it to list red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: www.sciencedaily.com newsroom.urc.edu www.fullredguppy.com

  33. The Naive Approach • problems? • slow!! • for each URL in left list, we potentially had to compare it to every URL in right list • under our previous assumption (billion size lists), we have to do 1 billion x 1 billion comparisons • even for a powerful computer, this would require a considerable amount of time

  34. Alphabetized Lists • suppose that each list was maintained alphabetically • then we could employ the following approach • place a marker at start of each list • if markers point to same URL: • add URL to result list • move both markers down • otherwise, move the marker whose URL is lexicographically smaller • stop when at least one marker goes off the end of the list

  35. The Sorted Approach • place markers at the start of each list red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result:

  36. The Sorted Approach • do markers point to same URL? • no • since right marker's URL is less than left marker's URL, move right marker down red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result:

  37. The Sorted Approach • do markers point to same URL? • no • since left marker's URL is less than right marker's URL, move left marker down red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result:

  38. The Sorted Approach • do markers point to same URL? • yes • add URL to result • move both markers red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: newsroom.urc.edu

  39. The Sorted Approach • do markers point to same URL? • no • since right marker's URL is less than left marker's URL, move right marker down red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: newsroom.urc.edu

  40. The Sorted Approach • do markers point to same URL? • yes • add URL to result • move both markers red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: newsroom.urc.edu www.fullredguppy.com

  41. The Sorted Approach • do markers point to same URL? • no • since left marker's URL is less than right marker's URL, move left marker down red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: newsroom.urc.edu www.fullredguppy.com

  42. The Sorted Approach • do markers point to same URL? • yes • add URL to result • move both markers red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: newsroom.urc.edu www.fullredguppy.com www.sciencedaily.com

  43. The Sorted Approach • at least one marker has completed its list, so we can stop • notice that our result contains correct values red: en.wikipedia.org/wiki/red newsroom.urc.edu www.fullredguppy.com www.red.com www.sciencedaily.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com red: www.sciencedaily.com en.wikipedia.org/wiki/red newsroom.urc.edu www.red.com www.fullredguppy.com fish: en.wikipedia.org/wiki/fish newsroom.urc.edu www.fish.com www.fullredguppy.com www.sciencedaily.com result: newsroom.urc.edu www.fullredguppy.com www.sciencedaily.com

  44. The Sorted Approach • how many comparisons are done? • note that every step involves moving at least one arrow • hence, the maximum number of steps is 2 billion • this is considerably less than (1 billion) squared • result: a massive speedup

  45. The Sorted Approach – Notes • remember: commercial search engines don't fully publicize strategies • hence, some search engines may use alternate approaches for efficient intersections • the previous strategy applies to more than two lists simultaneously • hence, we can search for multiple tokens, rather than just two

  46. Example (from text):

  47. Ranking Results • a typical search can produce millions of results • however, we often find what we are looking for in the first few results • according to Optify, first returned result from Google gets clicked 36.4% of time • first page gets clicked through 90% of the time • how does this occur? • via a page ranking system http://searchenginewatch.com/article/2049695/Top-Google-Result-Gets-36.4-of-Clicks-Study

  48. Ranking Results • search providers have different ways of ranking the results of the search • Google: PageRank • proprietary (not all details available) • some details are public (considered next) • the higher the PageRank score, the closer to the top of the search results a page will be http://support.google.com/webmasters/bin/answer.py?hl=en&answer=70897

  49. PageRank • a scoring system • links from other pages add to a page's score Web pages Page 1 Page 4 Page 5 Page 2 Page 5 Page 6 Page 3 Page 5 Page 6 • the link from Page 1 adds to Page 4's score • the links from Pages 1,2,3 add to Page 5's score • the links from Page 2 and 3 add to Page 6's score Page 4 Page 5 Page 6

  50. PageRank • the score from each page is not weighted equally • the higher a page's PageRank, the more important its contribution is Web pages • suppose that Page 3 has one link (Page 1), and Page 4 has one link (Page 2) • since Page 2's rank is higher than Page 1's, then Page 4's rank will be higher than Page 3's Page 1 Page 3 Page 2 Page 4 Low Rank High Rank Page 3 Page 4

More Related