1 / 23

’s Search Algorithm

’s Search Algorithm. Internet circa 1994. Problem #1 Internet small but growing, and already too much information to know where to go Solution #1 Centralized list of popular, high-quality websites by subject => Yahoo.com! Weaknesses of Solution #1 Subjective Expensive to build and maintain

fritz
Download Presentation

’s Search Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ’s Search Algorithm

  2. Internet circa 1994 • Problem #1 • Internet small but growing, and already too much information to know where to go • Solution #1 • Centralized list of popular, high-quality websites by subject => Yahoo.com! • Weaknesses of Solution #1 • Subjective • Expensive to build and maintain • Slow to improve • Cannot cover esoteric topics • http://infolab.stanford.edu/~backrub/google.html

  3. Internet circa 1994 • Problem #2 • Search engines based on keyword matching yield many low quality matches • Also, easy for advertisers to create “perfect matches” for popular search terms that don’t have any value • Solution #2 (how Google was born) • Create a better search algorithm that combines keyword matching with a website quality score

  4. Google’s “Quality Score” • How do you measure the quality of a webpage? • Ask people to provide ratings? • Too expensive • Can be gamed • Utilize underlying structure of internet from the hyper text? • YES!!

  5. PageRank • Webpage A has T1, . . . , Tn pages that point to it. C(A) is the number of outbound links from page A • PR(A) = PR(T1)/C(T1) + . . . + PR(Tn)/C(Tn) • Interpretation: drop a random web surfer on a random webpage, than have him randomly click forward links (without hitting the ‘back” button). PageRank is the probability that he lands on a given website

  6. How to Increase PageRank • Get a link from Yahoo • PR(A) = PR(T1)/C(T1) + . . . + PR(Tn)/C(Tn) • If a site pointing to you has high page rank (e.g. PR(T1) is high), your page rank goes up • Get lots of links from “regular” websites • More links means more to add • Be careful of “black hat” strategies

  7. Search Engine Step-by-Step Crawling and Updating

  8. Search Engine Step-by-Step • URL Server directs crawlers where to crawl • Crawlers fetch URLs, which are given to Store Server • Store Server compresses them to Repository for storage • Every web page (not web site) is given unique docID • Indexer and Sorter perform the key work

  9. Search Engine Step-by-Step Translating into Search Engine Language

  10. Search Engine Step-by-Step • INDEXER . . . • . . .Reads repository, uncompresses documents, and parses them • . . .Converts documents into set of word occurrences called hits, which record word, position in document, font size, capitalization • . . .Distributes hits to Barrels • . . .Parses links to store ‘to’ and ‘from’ and text of link into the Anchors file

  11. Search Engine Step-by-Step • Sorter . . . • . . .Resorts Barrels from docID to wordID to generate inverted index • . . .Produces list of wordIDs that are filed into the Lexicon so they can be matched to server queries

  12. Search Engine Step-by-Step Mapping the Network

  13. Search Engine Step-by-Step • URL Resolver . . . • . . .Reads Anchors file to convert them to docIDs • . . .Generates database of links (paired docIDs) • . . .Puts anchor text to Doc Index associated with docID it points to

  14. Okay, so how does it work? • Hit Lists • Every web page is re-coded as a series of hits • Plain hits • Capitalized? • Font size (0-6) • Position (1-4096) • Fancy hits • Capitalized? • Font size = 7 • Type: (URL? Title? Metatag? Anchor text?) • Position: (1-256 or 1-16 for anchor text)

  15. Re-coded Web Page Forward Barrels • Google has re-coded the whole internet. Can these codes give fast search results? Cap: 0, font: 3, position: 173

  16. Resorted to be Searchable Inverted Barrels

  17. Searching on Google . . . Finally!

  18. Create IR Score Hit 1: Title type-weight = 100 Hit 2: URL type-weight = 100 Hit 3: Large font type-weight = 40 Hit 4: Small font type-weight = 10 Hit 5: Small font type-weight = 10 IR Score = 100 + 100 + 40 + 10 + 10 = 260

  19. IR Score Aside • In actuality, type-weight combined with count-weight • Here, we have five mentions of “metamorphosis” in the same type (of 10-point type-weight) • This would not yield 50 points. Count-weights decrease after a certain number. So the IR Score would increase by (10 + 10 + 8 + 5 +2) = 35 • This prevents gaming, so that Wikipedia can’t repeat “Metamorphosis” 500 times and increase the IR Score by 5000

  20. Final Results Ranking • Combine IR Score with PageRank to rank web pages matching the search • IR Score ensures that the web page matches what the user searched • The count-weight/type-weight combination ensures good match without gaming • In general 5% or fewer of the words should be the actual search word, but should be included in the title, the URL (if possible), and in large-font, bolded words • PageRank ensures that the web page is reputable • Typically not a lot of links to useless web pages • Google does a lot of work to ensure no cheating on this dimension

  21. What about Multiple Word Searches? “butterfly” “metamorphosis” “butterfly metamorphosis”

  22. Your Turn • Perform five searches • Make them as obscure or general, as long or as short as you like • Look at the top three results from each search • What made its IR Score high for that search? • Why do you think its PageRank is high? • For a one-word search, count how many times that word (or a derivative of that word) is listed. • Is it in the title? URL? In large font? • What percent of the total words is that word?

More Related