google case study l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Google: Case Study PowerPoint Presentation
Download Presentation
Google: Case Study

Loading in 2 Seconds...

play fullscreen
1 / 18

Google: Case Study - PowerPoint PPT Presentation


  • 167 Views
  • Uploaded on

Google: Case Study. cs430 lecture 15 03/13/01 Kamen Yotov. Introduction: What’s new?. Amount of web information growing Amount of inexperienced users growing Surfers willing to start from indices like Yahoo! Expensive to build and maintain; Slow to improve; Cannot cover all topics!

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Google: Case Study' - cleary


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
google case study

Google: Case Study

cs430 lecture 15

03/13/01

Kamen Yotov

introduction what s new
Introduction: What’s new?
  • Amount of web information growing
  • Amount of inexperienced users growing
  • Surfers willing to start from indices like Yahoo!
    • Expensive to build and maintain;
    • Slow to improve;
    • Cannot cover all topics!
  • Google – large scale search engine
    • Name from “googol” = 10100
    • Uses heavily additional structure = quality results

Google: Case Study

introduction continued
Introduction (continued…)
  • Search engine technology to scale
  • Server requests to scale similarly…
  • Technology advances help… but no so much!
    • E.g. disk seek time, operating system problems
    • Expect cost of indexing/storing text/html to drop relative to amount of information available!

Google: Case Study

main goals quality quality
Main goals: Quality, Quality,…
  • Completeness of index is just one factor
  • Lots of junk in the results
  • Number of documents increase exponentially, but user ability does not!
  • High precision very important!
  • Link structure & Link text are valuable
  • … Not much information; Commercial!

Google: Case Study

features pagerank
Features: PageRank
  • Heavy use of the link structure
  • Performs well even indexing only the titles
  • Counting links to a page
    • Weghts on the sources
  • Page A has pages Ti pointing to it.
    • d: damping factor
    • C(A): # of links out of A

Google: Case Study

related work applicability
Related Work: Applicability
  • Information retrieval
    • Size does matter! Large corpuses are small for the means of Web search (20GB/147GB)
    • Vector methods often tend to return short documents
    • Argument: Users should specify more concretely what they search for!Google: disagree!
  • Other differences from controlled collections
    • No format, language restrictions, control
    • Extended meta information

Google: Case Study

from inside
From Inside…
  • Mostly C/C++
  • Solaris/Linux
  • Module-based architecture
  • Multi-machine
  • Multi-thread
  • Resource dedication

Google: Case Study

major structures
Major Structures
  • BigFiles
    • Span several file systems
    • 64-bit addressed
    • Descriptor management
    • Compression
  • Document index
    • ISAM (Index sequential access mode), ordered by docID
    • Pointer to Repository, Status, Statistics
    • Pointer to URL and Title in docinfo file if crawled
    • URL to docID conversion (checksum)

Google: Case Study

major structures continued
Major Structures (continued)
  • Repository
    • Zlib compressed
    • docID, Length, URL
    • Self-consistent data
  • Lexicon
    • Memory resident
    • List of words and a hash-table of pointers
    • Other auxiliary information… (out of scope)

Google: Case Study

major structures continued 2
Major Structures (continued 2)
  • Hit Lists
    • Word in a document + typesetting information (hand-encoded)
    • Take most of the space of all indices

Google: Case Study

major structures continued 3
Major Structures (continued 3)
  • Forward Index
    • Partially sorted
    • Stored in a number of barrels
    • Each barrel holds range of wordIDs + hitlist

Google: Case Study

major structures continued 4
Major Structures (continued 4)
  • Inverted Index
    • Same barrels, but processed by the sorter
    • Not stored by ranking in occurrence for the sake of speed
    • Two sets of inverted barrels

Google: Case Study

crawling the web
Crawling the Web
  • We talked before…
  • Fragile, beyond our control
  • Implemented in Python
  • Internal DNS cache for each crawler
  • Social issues
    • Phone calls, support
    • Preventing indexing
  • Virtually unable to debug… just test!

Google: Case Study

indexing the web
Indexing the Web
  • Parsing problems
    • Errors in HTML
    • Non-ASCII characters
    • Home-grown parser (not YACC)
  • Indexing documents into barrels
    • Shared lexicon – too much locking
    • Log file of new words… processed at end
  • Sorting

Google: Case Study

searching
Searching
  • Parse the query
  • Convert words to wordIDs
  • Seek to start of doclist in the short barrel for every word
  • Scan through until a document that matches all terms is encountered
  • Compute the rank of that document
  • Repeat the same thing for the full barrel
  • Sort the documents matched by rank and return the first few

Google: Case Study

results and performance

Query: bill clinton http://www.whitehouse.gov/100.00%  (no date) (0K)   http://www.whitehouse.gov/   Office of the President        99.67% (Dec 23 1996) (2K)            http://www.whitehouse.gov/WH/EOP/OP/html/OP_Home.html  Welcome To The White House        99.98%  (Nov 09 1997) (5K)          http://www.whitehouse.gov/WH/Welcome.html    Send Electronic Mail to the President        99.86%  (Jul 14 1997) (5K)            http://www.whitehouse.gov/WH/Mail/html/Mail_President.html   mailto:president@whitehouse.gov99.98%    mailto:President@whitehouse.gov        99.27%    The "Unofficial" Bill Clinton 94.06% (Nov 11 1997) (14K)   http://zpub.com/un/un-bc.html   Bill Clinton Meets The Shrinks          86.27%  (Jun 29 1997) (63K)             http://zpub.com/un/un-bc9.html   President Bill Clinton - The Dark Side97.27%  (Nov 10 1997) (15K)   http://www.realchange.org/clinton.htm   $3 Bill Clinton94.73%  (no date) (4K) http://www.gatewy.net/~tjohnson/clinton1.html  

Results and Performance
  • Quality of results
  • Manual ranking
  • Sorting
    • PageRank
    • Anchor text
    • Proximity
  • Broken links

Google: Case Study

performance
Performance
  • Storage
    • Scale with the size of the Web
    • Repository is comparatively small
    • Good/Fast compression/decompression
  • System
    • Crawling, Indexing, Sorting
    • Last two simultaneously
  • Searching
    • Bounded by dish IO over LAN (NFS)

Google: Case Study

conclusion
Conclusion
  • Google:
    • Scalable search engine
    • Complete architecture
  • Many research ideas arise
    • Always something to improve
    • Matter of time
  • High quality search is the dominant factor

Google: Case Study