The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page

The Anatomy of a Large-Scale Hyper textual Web Search EngineS. Brin, L. Page Presenter :- Abhishek Taneja

Why Google was introduced or required? • Because there were problems with existing search engines. For example, • Human maintained Lists/indices -- subjective, expensive to build and maintain -- slow to improve -- cannot cover all the esoteric topics • Automated Search Engines -- Rely on keywords matching -- Easy to mislead them Why Google?

Some facts about Google • Why Google is called Google -- Because it is a common spelling of googol or 10^100 and fits well with their goal of building very large-scale search engine. • Just to let you know that we are talking about Google of year 1997. Much of the modules it incorporated then were made open source. So we know a lot about Google of year 1997. But we do not know much about Google of 2010 because most of its modules are proprietary. Facts about Google

Goals behind Google • Scalability -- Number of pages indexed. -- Number of queries handled. • Quality -- To provide high quality search results • Eliminating junk results -- Using link structure and anchor text for quality filtering. • To push more development and understanding into the academic realm. • To increase usability. • To setup a space lab-like environment where researchers or even students can propose and do interesting experiments on Google’s large scale web data. Goals

Features of Google Search Engine • Uses link structure of the web to calculate a quality ranking for each web page called page rank. • The probability that a random surfer visits a page is called its page rank. It gives some approximation of page’s importance and quality. • PR(A)=(1-d)+d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn)) -- Where PR(A) is the Page Rank of Page A . -- PR(T1) is the Page Rank of a site pointing to Page A -- C(T1) is the number of links off that page which points to A -- PR(Tn) /C(Tn) means we do that for each page pointing to Page A -- Where T1…Tn is the set of pages with incoming links to page A -- d is a dampening factor. It is the probability at each page the random surfer will get bored and request another random page. Nominally this is set to 0.85 Features

Features of Google (cont.) • Anchor Text. -- Google utilizes the data in anchor text and associates it with the page the link points to. For example, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs and databases. • This search engine has location information for all hits and so it makes extensive use of proximity in search. • Google keeps track of some visual presentation details such as font size of words. Words in <h1>, <b> tags are weighted higher than other words. • Full raw HTML of pages is available in repository. Features(cont.)

Google Architecture

Google Architecture (cont.) Multiple crawlers run in parallel. Each crawler keeps its own DNS lookup cache and ~300 open connections open at once. It sends lists of URLs to be fetched to the crawlers Compresses and stores web pages in a repository The indexer parses out all links in a web page and Stores important information about them in it. Converts relative URLs into absolute URLs & into doc IDs Reads the repository, un compresses the documents and parses them. Stores link information in anchors file and makes Hit lists Contains Entire html of every web page. Each document is prefixed by docID, length, and URL. 8 8

Google Architecture (cont.) Parses & distributes hit lists into “barrels.” Maps absolute URLs into doc IDs stored in Doc Index. Stores anchor text in “barrels”. Generates database of links (pairs of doc Ids). Partially sorted forward indexes sorted by doc ID. Each barrel stores hit lists for a given range of word IDs. In-memory hash table that maps words to word Ids. Contains pointer to doc list in barrel which word Id falls into. Creates inverted index whereby document list containing doc ID and hit lists can be retrieved given word ID. • Doc ID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also contains URL info if doc has been crawled. If not just contains URL. 9

Google Architecture (cont.) 2 kinds of barrels. Short barrel which contains hit list & which includes title or anchor hits. Long barrel for all hit lists. List of word Ids produced by Sorter and lexicon created by Indexer used to create new lexicon used by searcher. Lexicon stores ~14 million words.

Results and Performance • Performance of a search engine depends on quality of its search results and quality of search results are judged by its users. -- So After collecting lots of feedback from users and researchers, it was found out that the results were of good quality. For example, Google at that time to was able to produce top search results with no broken links. • Google also placed heavy importance on the proximity of word occurrences. For example, search results for Bill Clinton does not produce independent results for Bill and Clinton. • Storage efficiency was achieved by using compression techniques like zlib, bzip. • System Performance was increased by optimizing the indexer, running sorters in parallel, optimizing the data structures to store the information. Results and Performance

Conclusions Google is designed to be a scalable search engine. Primary goal is to provide high quality search results over a rapidly growing world wide web. Google employs a number of techniques to improve search quality including page rank, anchor text and proximity of information. Furthermore, Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them. Conclusions

Pros of the paper A landmark paper which gives an insight into the search engine architecture of Google. First known public description of Page Rank. New ways of ranking proposed based on link structure which comes very close to the notion of “Relevant” documents. Pros of the paper

Cons of paper As we know by now that the paper is about Google of year 1997 and so number of Goals proposed were not being implemented. For example, to make Google a part of academic realm. Judging the quality of webpage by only page rank and data in anchor text is not sufficient. Cons of the paper

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page