230 likes | 325 Views
’s Search Algorithm. Internet circa 1994. Problem #1 Internet small but growing, and already too much information to know where to go Solution #1 Centralized list of popular, high-quality websites by subject => Yahoo.com! Weaknesses of Solution #1 Subjective Expensive to build and maintain
E N D
Internet circa 1994 • Problem #1 • Internet small but growing, and already too much information to know where to go • Solution #1 • Centralized list of popular, high-quality websites by subject => Yahoo.com! • Weaknesses of Solution #1 • Subjective • Expensive to build and maintain • Slow to improve • Cannot cover esoteric topics • http://infolab.stanford.edu/~backrub/google.html
Internet circa 1994 • Problem #2 • Search engines based on keyword matching yield many low quality matches • Also, easy for advertisers to create “perfect matches” for popular search terms that don’t have any value • Solution #2 (how Google was born) • Create a better search algorithm that combines keyword matching with a website quality score
Google’s “Quality Score” • How do you measure the quality of a webpage? • Ask people to provide ratings? • Too expensive • Can be gamed • Utilize underlying structure of internet from the hyper text? • YES!!
PageRank • Webpage A has T1, . . . , Tn pages that point to it. C(A) is the number of outbound links from page A • PR(A) = PR(T1)/C(T1) + . . . + PR(Tn)/C(Tn) • Interpretation: drop a random web surfer on a random webpage, than have him randomly click forward links (without hitting the ‘back” button). PageRank is the probability that he lands on a given website
How to Increase PageRank • Get a link from Yahoo • PR(A) = PR(T1)/C(T1) + . . . + PR(Tn)/C(Tn) • If a site pointing to you has high page rank (e.g. PR(T1) is high), your page rank goes up • Get lots of links from “regular” websites • More links means more to add • Be careful of “black hat” strategies
Search Engine Step-by-Step Crawling and Updating
Search Engine Step-by-Step • URL Server directs crawlers where to crawl • Crawlers fetch URLs, which are given to Store Server • Store Server compresses them to Repository for storage • Every web page (not web site) is given unique docID • Indexer and Sorter perform the key work
Search Engine Step-by-Step Translating into Search Engine Language
Search Engine Step-by-Step • INDEXER . . . • . . .Reads repository, uncompresses documents, and parses them • . . .Converts documents into set of word occurrences called hits, which record word, position in document, font size, capitalization • . . .Distributes hits to Barrels • . . .Parses links to store ‘to’ and ‘from’ and text of link into the Anchors file
Search Engine Step-by-Step • Sorter . . . • . . .Resorts Barrels from docID to wordID to generate inverted index • . . .Produces list of wordIDs that are filed into the Lexicon so they can be matched to server queries
Search Engine Step-by-Step Mapping the Network
Search Engine Step-by-Step • URL Resolver . . . • . . .Reads Anchors file to convert them to docIDs • . . .Generates database of links (paired docIDs) • . . .Puts anchor text to Doc Index associated with docID it points to
Okay, so how does it work? • Hit Lists • Every web page is re-coded as a series of hits • Plain hits • Capitalized? • Font size (0-6) • Position (1-4096) • Fancy hits • Capitalized? • Font size = 7 • Type: (URL? Title? Metatag? Anchor text?) • Position: (1-256 or 1-16 for anchor text)
Re-coded Web Page Forward Barrels • Google has re-coded the whole internet. Can these codes give fast search results? Cap: 0, font: 3, position: 173
Resorted to be Searchable Inverted Barrels
Create IR Score Hit 1: Title type-weight = 100 Hit 2: URL type-weight = 100 Hit 3: Large font type-weight = 40 Hit 4: Small font type-weight = 10 Hit 5: Small font type-weight = 10 IR Score = 100 + 100 + 40 + 10 + 10 = 260
IR Score Aside • In actuality, type-weight combined with count-weight • Here, we have five mentions of “metamorphosis” in the same type (of 10-point type-weight) • This would not yield 50 points. Count-weights decrease after a certain number. So the IR Score would increase by (10 + 10 + 8 + 5 +2) = 35 • This prevents gaming, so that Wikipedia can’t repeat “Metamorphosis” 500 times and increase the IR Score by 5000
Final Results Ranking • Combine IR Score with PageRank to rank web pages matching the search • IR Score ensures that the web page matches what the user searched • The count-weight/type-weight combination ensures good match without gaming • In general 5% or fewer of the words should be the actual search word, but should be included in the title, the URL (if possible), and in large-font, bolded words • PageRank ensures that the web page is reputable • Typically not a lot of links to useless web pages • Google does a lot of work to ensure no cheating on this dimension
What about Multiple Word Searches? “butterfly” “metamorphosis” “butterfly metamorphosis”
Your Turn • Perform five searches • Make them as obscure or general, as long or as short as you like • Look at the top three results from each search • What made its IR Score high for that search? • Why do you think its PageRank is high? • For a one-word search, count how many times that word (or a derivative of that word) is listed. • Is it in the title? URL? In large font? • What percent of the total words is that word?