1 / 18

Google: Case Study

Google: Case Study. cs430 lecture 15 03/13/01 Kamen Yotov. Introduction: What’s new?. Amount of web information growing Amount of inexperienced users growing Surfers willing to start from indices like Yahoo! Expensive to build and maintain; Slow to improve; Cannot cover all topics!

cleary
Download Presentation

Google: Case Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov

  2. Introduction: What’s new? • Amount of web information growing • Amount of inexperienced users growing • Surfers willing to start from indices like Yahoo! • Expensive to build and maintain; • Slow to improve; • Cannot cover all topics! • Google – large scale search engine • Name from “googol” = 10100 • Uses heavily additional structure = quality results Google: Case Study

  3. Introduction (continued…) • Search engine technology to scale • Server requests to scale similarly… • Technology advances help… but no so much! • E.g. disk seek time, operating system problems • Expect cost of indexing/storing text/html to drop relative to amount of information available! Google: Case Study

  4. Main goals: Quality, Quality,… • Completeness of index is just one factor • Lots of junk in the results • Number of documents increase exponentially, but user ability does not! • High precision very important! • Link structure & Link text are valuable • … Not much information; Commercial! Google: Case Study

  5. Features: PageRank • Heavy use of the link structure • Performs well even indexing only the titles • Counting links to a page • Weghts on the sources • Page A has pages Ti pointing to it. • d: damping factor • C(A): # of links out of A Google: Case Study

  6. Related Work: Applicability • Information retrieval • Size does matter! Large corpuses are small for the means of Web search (20GB/147GB) • Vector methods often tend to return short documents • Argument: Users should specify more concretely what they search for!Google: disagree! • Other differences from controlled collections • No format, language restrictions, control • Extended meta information Google: Case Study

  7. From Inside… • Mostly C/C++ • Solaris/Linux • Module-based architecture • Multi-machine • Multi-thread • Resource dedication Google: Case Study

  8. Major Structures • BigFiles • Span several file systems • 64-bit addressed • Descriptor management • Compression • Document index • ISAM (Index sequential access mode), ordered by docID • Pointer to Repository, Status, Statistics • Pointer to URL and Title in docinfo file if crawled • URL to docID conversion (checksum) Google: Case Study

  9. Major Structures (continued) • Repository • Zlib compressed • docID, Length, URL • Self-consistent data • Lexicon • Memory resident • List of words and a hash-table of pointers • Other auxiliary information… (out of scope) Google: Case Study

  10. Major Structures (continued 2) • Hit Lists • Word in a document + typesetting information (hand-encoded) • Take most of the space of all indices Google: Case Study

  11. Major Structures (continued 3) • Forward Index • Partially sorted • Stored in a number of barrels • Each barrel holds range of wordIDs + hitlist Google: Case Study

  12. Major Structures (continued 4) • Inverted Index • Same barrels, but processed by the sorter • Not stored by ranking in occurrence for the sake of speed • Two sets of inverted barrels Google: Case Study

  13. Crawling the Web • We talked before… • Fragile, beyond our control • Implemented in Python • Internal DNS cache for each crawler • Social issues • Phone calls, support • Preventing indexing • Virtually unable to debug… just test! Google: Case Study

  14. Indexing the Web • Parsing problems • Errors in HTML • Non-ASCII characters • Home-grown parser (not YACC) • Indexing documents into barrels • Shared lexicon – too much locking • Log file of new words… processed at end • Sorting Google: Case Study

  15. Searching • Parse the query • Convert words to wordIDs • Seek to start of doclist in the short barrel for every word • Scan through until a document that matches all terms is encountered • Compute the rank of that document • Repeat the same thing for the full barrel • Sort the documents matched by rank and return the first few Google: Case Study

  16. Query: bill clinton http://www.whitehouse.gov/100.00%  (no date) (0K)   http://www.whitehouse.gov/   Office of the President        99.67% (Dec 23 1996) (2K)            http://www.whitehouse.gov/WH/EOP/OP/html/OP_Home.html  Welcome To The White House        99.98%  (Nov 09 1997) (5K)          http://www.whitehouse.gov/WH/Welcome.html    Send Electronic Mail to the President        99.86%  (Jul 14 1997) (5K)            http://www.whitehouse.gov/WH/Mail/html/Mail_President.html   mailto:president@whitehouse.gov99.98%    mailto:President@whitehouse.gov        99.27%    The "Unofficial" Bill Clinton 94.06% (Nov 11 1997) (14K)   http://zpub.com/un/un-bc.html   Bill Clinton Meets The Shrinks          86.27%  (Jun 29 1997) (63K)             http://zpub.com/un/un-bc9.html   President Bill Clinton - The Dark Side97.27%  (Nov 10 1997) (15K)   http://www.realchange.org/clinton.htm   $3 Bill Clinton94.73%  (no date) (4K) http://www.gatewy.net/~tjohnson/clinton1.html   Results and Performance • Quality of results • Manual ranking • Sorting • PageRank • Anchor text • Proximity • Broken links Google: Case Study

  17. Performance • Storage • Scale with the size of the Web • Repository is comparatively small • Good/Fast compression/decompression • System • Crawling, Indexing, Sorting • Last two simultaneously • Searching • Bounded by dish IO over LAN (NFS) Google: Case Study

  18. Conclusion • Google: • Scalable search engine • Complete architecture • Many research ideas arise • Always something to improve • Matter of time • High quality search is the dominant factor Google: Case Study

More Related