230 likes | 325 Views
Explore the birth of Google's PageRank algorithm in response to search engine challenges in 1994. Learn about the revolutionary concept behind PageRank and its role in revolutionizing online search. Discover the step-by-step process on how Google transformed search engine capabilities, making search results more relevant and reliable. Uncover the intricate workings of indexing, sorting, and mapping the network, culminating in the development of the PageRank system that has become a cornerstone of modern internet search. Get insights into the essential components of IR Score and how it collaborates with PageRank to deliver accurate search outcomes while thwarting manipulation attempts. Engage in an immersive journey through the inner workings of search engine technology, revealing the complexities behind web page ranking and the delicate balance between relevance and reputation.
E N D
Internet circa 1994 • Problem #1 • Internet small but growing, and already too much information to know where to go • Solution #1 • Centralized list of popular, high-quality websites by subject => Yahoo.com! • Weaknesses of Solution #1 • Subjective • Expensive to build and maintain • Slow to improve • Cannot cover esoteric topics • http://infolab.stanford.edu/~backrub/google.html
Internet circa 1994 • Problem #2 • Search engines based on keyword matching yield many low quality matches • Also, easy for advertisers to create “perfect matches” for popular search terms that don’t have any value • Solution #2 (how Google was born) • Create a better search algorithm that combines keyword matching with a website quality score
Google’s “Quality Score” • How do you measure the quality of a webpage? • Ask people to provide ratings? • Too expensive • Can be gamed • Utilize underlying structure of internet from the hyper text? • YES!!
PageRank • Webpage A has T1, . . . , Tn pages that point to it. C(A) is the number of outbound links from page A • PR(A) = PR(T1)/C(T1) + . . . + PR(Tn)/C(Tn) • Interpretation: drop a random web surfer on a random webpage, than have him randomly click forward links (without hitting the ‘back” button). PageRank is the probability that he lands on a given website
How to Increase PageRank • Get a link from Yahoo • PR(A) = PR(T1)/C(T1) + . . . + PR(Tn)/C(Tn) • If a site pointing to you has high page rank (e.g. PR(T1) is high), your page rank goes up • Get lots of links from “regular” websites • More links means more to add • Be careful of “black hat” strategies
Search Engine Step-by-Step Crawling and Updating
Search Engine Step-by-Step • URL Server directs crawlers where to crawl • Crawlers fetch URLs, which are given to Store Server • Store Server compresses them to Repository for storage • Every web page (not web site) is given unique docID • Indexer and Sorter perform the key work
Search Engine Step-by-Step Translating into Search Engine Language
Search Engine Step-by-Step • INDEXER . . . • . . .Reads repository, uncompresses documents, and parses them • . . .Converts documents into set of word occurrences called hits, which record word, position in document, font size, capitalization • . . .Distributes hits to Barrels • . . .Parses links to store ‘to’ and ‘from’ and text of link into the Anchors file
Search Engine Step-by-Step • Sorter . . . • . . .Resorts Barrels from docID to wordID to generate inverted index • . . .Produces list of wordIDs that are filed into the Lexicon so they can be matched to server queries
Search Engine Step-by-Step Mapping the Network
Search Engine Step-by-Step • URL Resolver . . . • . . .Reads Anchors file to convert them to docIDs • . . .Generates database of links (paired docIDs) • . . .Puts anchor text to Doc Index associated with docID it points to
Okay, so how does it work? • Hit Lists • Every web page is re-coded as a series of hits • Plain hits • Capitalized? • Font size (0-6) • Position (1-4096) • Fancy hits • Capitalized? • Font size = 7 • Type: (URL? Title? Metatag? Anchor text?) • Position: (1-256 or 1-16 for anchor text)
Re-coded Web Page Forward Barrels • Google has re-coded the whole internet. Can these codes give fast search results? Cap: 0, font: 3, position: 173
Resorted to be Searchable Inverted Barrels
Create IR Score Hit 1: Title type-weight = 100 Hit 2: URL type-weight = 100 Hit 3: Large font type-weight = 40 Hit 4: Small font type-weight = 10 Hit 5: Small font type-weight = 10 IR Score = 100 + 100 + 40 + 10 + 10 = 260
IR Score Aside • In actuality, type-weight combined with count-weight • Here, we have five mentions of “metamorphosis” in the same type (of 10-point type-weight) • This would not yield 50 points. Count-weights decrease after a certain number. So the IR Score would increase by (10 + 10 + 8 + 5 +2) = 35 • This prevents gaming, so that Wikipedia can’t repeat “Metamorphosis” 500 times and increase the IR Score by 5000
Final Results Ranking • Combine IR Score with PageRank to rank web pages matching the search • IR Score ensures that the web page matches what the user searched • The count-weight/type-weight combination ensures good match without gaming • In general 5% or fewer of the words should be the actual search word, but should be included in the title, the URL (if possible), and in large-font, bolded words • PageRank ensures that the web page is reputable • Typically not a lot of links to useless web pages • Google does a lot of work to ensure no cheating on this dimension
What about Multiple Word Searches? “butterfly” “metamorphosis” “butterfly metamorphosis”
Your Turn • Perform five searches • Make them as obscure or general, as long or as short as you like • Look at the top three results from each search • What made its IR Score high for that search? • Why do you think its PageRank is high? • For a one-word search, count how many times that word (or a derivative of that word) is listed. • Is it in the title? URL? In large font? • What percent of the total words is that word?