1 / 13

“The Anatomy of a Large-Scale Hypertextual Web Search Engine” ‘98

“The Anatomy of a Large-Scale Hypertextual Web Search Engine” ‘98. Google case Angela Fogarolli afogarol@dit.unitn.it 07/06/2006. Roadmap. Google design goals System features Page Rank Anchor Text Others System architecture System functionalities Crawling Indexing

urania
Download Presentation

“The Anatomy of a Large-Scale Hypertextual Web Search Engine” ‘98

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “The Anatomy of a Large-Scale Hypertextual Web Search Engine” ‘98 Google case Angela Fogarolli afogarol@dit.unitn.it 07/06/2006

  2. Roadmap • Google design goals • System features • Page Rank • Anchor Text • Others • System architecture • System functionalities • Crawling • Indexing • Searching • Conclusion

  3. Google design goals • Improve search quality • Improve search engine usability • Improve scalability on large web data.

  4. System feature: PageRank PageRank is the probability that a random surfer visits a page. PageRank is based on citation (link) graph. • It does not count links from all pages equally.It normalizes link numbers by the number of link in a page. • PageRank recursively propagates weights through the link structure of the web

  5. PageRank Calculation PR(A)=(1-d)+d(PR(T1)/C(T1)+… (PR(Tn)/C(Tn)) Page A has pages T1…Tn which point to it • d is a dumping factor, usually is set to 0.85 • C(A) is the number of links going out of page A • Example: • A page has a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank.

  6. System feature: Anchor Text The text of the link is associated with the page the link is on. In addition Google associates it with the page the link points to. Advantages: • Anchors often provide more accurate descriptions of web pages than the pages themselves. • Anchors may exist for documents which cannot be indexed (images, programs and db)

  7. System features: Others • Extensive use of proximity in search, it keeps location information for all hits. • Presentation details such as font size are important for weight calculation of hits.

  8. System Architecture • Several distributed crawlers • The fetched web pages are sent to the storeserver that compresses and stores them into a repository • Each parsed webpage has an ID number called a docID. • The indexer reads the repository, uncompresses the documents and parses them. Each doc is converted into a set of hits. The indexer distributes the hits into a set of barrels. The indexer takes the link in the webpages and stores them in an anchors file.

  9. System Architecture • The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in docIDs. It generates a db of links which are pairs of docIDs. • The link db is used to compute PageRank. • The sorter takes the barrels which are sorted by docID and resorts them by wordID to generate the inverted index.

  10. System functionalities: Crawling Google has a fast distributed crawling system 100 web pages per second using 4 crawlers Single URLServer serves list of URLs to a number of crawlers (typically 3).

  11. System functionalities: Indexing • Parsing: must handle huge amount of errors; • Indexing doc into Barrels: each doc is parsed and is encoded into a number of barrels. Every word is converted into a wordID using an in-memory hash table – the lexicon. Then the word occurrences are translated into hit lists and are written into the forward barrels. • Sorting: To generate the inverted index, the sorter takes each of the forward barrels and sorts it by by wordID to produce an inverted barrel for title and anchor hits and a full text inverted barrel.

  12. System functionalities: Searching The searcher is run by a web server and uses lexicon together with the inverter index and the PageRank to answer a query. Single word search • Google looks at the document’s hit list for the that word. • It calculates the IR weight of the doc: count weight (number of occur.) x type weight. • It computes final rank combining the IR weight with the PagerRank Multiple word search • The hits occur close in one doc are weight higher than hits occurring apart. • For every set of matched hits proximity is compute. Proximity is based on how far apart the hits are in the doc. • IRw= type-prox-w X type-w

  13. Conclusion Google in a scalable architecture for : • gathering • indexing • searching web pages. It guarantees quality of search using pageRank, anchor text and proximity information.

More Related