Large-Scale Hypertextual Web Search Engine Overview

The Anatomy of a Large-Scale Hypertextual Web Search Engine (Sergey Brin & Lawrence Page) Anushka Anand & Paul Varkey (Spring 05) CS583 University of Illinois at Chicago

The Challenge • High-quality search results • high-quality ≠ completeness of index (why?) • scale-up traditional techniques – crawling, indexing and query answering • uncontrolled, heterogeneous collection

The Approach • exploiting additional information present in hypertext • link structure • link text

The Problems • People surf Web link graph – start from popular indices (e.g. Y!) • subjective • expensive to build and maintain • slow to improve • not exhaustive • Existing techniques were based on keyword matching • too many low-quality (irrelevant) matches • exploited by advertisers

System Features • PageRank • utilizes link structure • calculates quality ranking for each Web page • text of links (anchors) associated with page that link points to • more accurately describes page than page itself • anchors may exist for documents that cannot be indexed – images, programs, databases • makes it possible to return uncrawled pages • related problem (non-existent page) • solution – sort results

System Features • Other Features • proximity (hit location) information • font information (size, bold etc.) • full raw HTML repository

PageRank • an objective measure of a page’s citation importance • corresponds well with subjective idea of importance • hence, an excellent way to prioritize results • calculated using the Web’s citation (link) graph • maps of hyperlinks – allows rapid PageRank calculation • citation (backlink) count is used to approximate importance or quality • PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn)) • where • page ‘A’ has pages T1 … Tn which point to it, • ‘d’ is a damping factor, and • ‘C(A)’ is the number of links going out of A

PageRank • intuitive justifications • model of user behavior – “random surfer” model • high PageRank for a page corresponds to large number and/or high rank of pages pointing to it • PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn)) • where • page ‘A’ has pages T1 … Tn which point to it, • ‘d’ is a damping factor, and • ‘C(A)’ is the number of links going out of A

Google Architecture Overview Store Server URL Server Crawler Repository Anchors Indexer URL Resolver Lexicon Links Barrels Doc Index Sorter Searcher PageRank

Sorter Store Server URL Server Crawler Anchors Repository Indexer URL Resolver Links Lexicon Barrels Doc Index Searcher PageRank Google Architecture (1) • URL Server • Sends lists of URLs to distributed Crawlers • Crawler • Download webpages • Store Server • Compress fetched pages • Stores them in the Repository

Store Server Sorter URL Server Crawler Anchors Indexer Repository URL Resolver Lexicon Links Barrels Doc Index Searcher PageRank Google Architecture (2) • Indexer • Reads Repository, uncompresses pages, parses pages. • Assigns docID when new URLs are parsed • Distributes hits into Barrels creating a partially sorted forward index • Parses out all links in every webpage and stores important information about them in Anchors file

Sorter Store Server URL Server Crawler Anchors Indexer Repository URL Resolver Lexicon Links Barrels Doc Index Searcher PageRank Google Architecture (3) • URL Resolver • Converts relative URLs to absolute URLs and docIDs • Puts anchor text into the forward index the anchor points to • Generates a database of Links (pairs of docIDs) • Links • Used to compute PageRank for all documents

Store Server URL Server Crawler Anchors Repository Indexer URL Resolver Lexicon Links Barrels Doc Index PageRank Sorter Searcher Google Architecture (4) • Sorter • Sorts (in-place) Barrels by wordID and generates inverted index • Lexicon • Created by a program DumpLexicon and indexer • List of unique words found • Searcher • Run by a web server • Uses Lexicon, inverted index and PageRank to answer queries

Repository: 53.5 GB = 147.8 GB uncompressed sync length compressed packet sync length compressed packet Packet (stored compressed in Repository docID eCode urllen pagelen page Repository • Contains full HTML of webpages • Compressed using zlib • 3:1 compression ratio • Documents stored sequentially • Prefixed with docID, length, URL

Document Index (1) • Stores current document status, pointer into Repository, document checksum, other statistics ordered by docID. • Status of document • Crawled - contains a pointer to a variable length file called docinfo that holds its URL and title • Not yet crawled – contains pointer to URLlist with just the URL • Fetch a record in 1 disk seek during search

Document Index Repository URLlist docID 1 www.uic.edu Sync Length Compressed packet www.stanford.edu 2 Sync Length Compressed packet … Document Index (2) • Convert URLs -> docIDs: List of URL checksums with corresponding docIDs sorted by checksums • Compute URL checksum and do binary search • URLs converted to docIDs in batch mode by merging with this file - URLResolver crawled NOT crawled docinfo URL Title Stanford University www.stanford.edu www.uic.edu U of Illinois Chicago

Hit: 2 bytes cap:1 imp:3 position:12 Plain: Fancy: cap:1 imp:7 type:4 position:8 Anchor: cap:1 imp:7 type:4 hash:4 pos:4 Hit Lists • A list of occurrences of a particular word in a particular document with position, font and capitalization information • Hand optimized compact encoding – less space and less bit manipulation • To save space, the length of the hit list is combined with the wordID in the forward index (8 bits) and the docID in the inverted index (5 bits). • Types of hit: • Fancy – hits in a URL, title, anchor text or meta tag • Anchor – last 8 position bits split into 4 bits for position in anchor and 4 bits for hash of the docID the anchor is in • Plain – everything else Font size – relative to rest of document. Uses 3 bits where : 7 - 111 => fancy hit

Forward Barrels: total 43 GB docID wordID:24 nhits:8 hit hit hit hit wordID:24 nhits:8 hit hit hit hit null wordID docID wordID:24 nhits:8 hit hit hit hit wordID:24 nhits:8 hit hit hit hit wordID:24 nhits:8 hit hit hit hit null wordID Forward Index • Partially sorted and stored in 64 barrels • Barrel holds a range of wordIDs • If a document contains words that fall into a particular barrel: • store the docID , a list of wordIDs with hitlists for each word • Store each wordID as a relative difference from the minimum wordID that falls into the barrel the wordID is in. • Uses 24 bits for the wordIDs in the unsorted barrels • 8 bits for the hit list length.

Inverted Barrels: total 41 GB Lexicon: 293 MB wordID wordID wordID ndocs ndocs ndocs docID:27 docID:27 nhits:5 nhits:5 hit hit hit hit hit hit hit hit docID:27 docID:27 nhits:5 nhits:5 hit hit hit hit hit hit … Inverted Index • Same barrels as forward index; processed by the sorter • For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into. • It points to a doclist of docIDs together with their corresponding hit lists. This doclist represents all the occurrences of that word in all documents. • Keep two sets of inverted barrels – • hitlists with title or anchor hits – ordered by ranking of word occurrence • all hit lists – ordered by docIDs

Crawler Crawler Crawler Crawling • Distributed crawling system • typically 3 crawlers • asynchronous IO and number of queues used • moving page fetches from state to state • Challenges – performance & reliability • DNS lookup – major performance stress • DNS cache Internet URLs URL Server

Indexing • Parsing • handle various kinds of errors • Indexing • after parsing, documents are encoded into barrels • every word converted to wordID • using in-memory hash table, the lexicon • new additions to lexicon logged to a file • after conversion, word occurrences are translated into hit lists and written into forward barrels • Sorting • sort barrels by wordID to produce inverted indices

Searching • Parse query • Convert word into wordIDs • Seek to start of doclist in the short barrel for every word • Scan through the doclists until there is a document that matches all the search terms • Compute the rank of that document for the query • If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4 • If we are not at the end of any doclist go to step 4 • Sort the documents that have been matched by rank and return the top k GOOGLE QUERY EVALUATION

The Ranking System • Single word query • calculated using document’s hit list for that word • type-weight vector: weights assigned to different types of hits (title, anchor, URL, large font etc) • count-weight vector: • count = number of hits of each type • increase in count has decreasing effect on increase in count-weight • IR score = type-weight ∙ count-weight • Final Rank = IR score + PageRank

The Ranking System • Multiple word query • multiple hit-lists scanned simultaneously • proximity – nearby hits are weighted higher • count-weight vector: computed using counts for every type and proximity • type-prox-weight vector: for every type and proximity pair • IR score = type-prox-weight ∙ count-weight

Performance • Storage • Compressed repository : 53.5 GB • Total without repository : 55.2 GB • Short Inverted Index : 4.1 GB • System Performance • difficult to quantify crawling time • took roughly 9 days for 26 million pages • indexer and crawler were run in parallel • indexer was optimized, hence faster • sorters run completely in parallel • using four machines – took 24 hours

Questions

Large-Scale Hypertextual Web Search Engine Overview