a 100 terabyte database system for 500k n.
Skip this Video
Loading SlideShow in 5 Seconds..
Accessing PowerPoint Presentation
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 12

Accessing - PowerPoint PPT Presentation

  • Uploaded on

Alexa Internet is a wholly owned subsidiary of. A 100 Terabyte Database System for $500k ... President, Alexa Internet. Director, Internet Archive. May, 2001 ...

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a 100 terabyte database system for 500k

A 100 Terabyte Database System for $500k

Brewster Kahle

President, Alexa Internet

Director, Internet Archive

hillis s law
“Hillis’s Law”
  • Price follows Volume
  • Corollary: Reliability follows Volume
  • Corollary: Availability follows Volume
  • For Reliability and Availability … buy inexpensive components
database defined
“Database” defined
  • Queryable, updateable, persistent data collection
  • This example:
    • 10billion WWW pages from 1996 till now
    • Metadata associated with those pages and websites
    • Various auxiliary tables and programs
  • Queryable: Retrieval of records based on fuzzy matches, of date, URL, other attributes
approach to large scale computation
Approach to Large Scale Computation
  • Need: Scale, Reliability, Flexibility, Evolution, Low Risk
  • Solution:
    • Commodity Hardware
    • Commodity Operating Systems
    • Commodity Software
    • Commodity Programmers
  • Homogenous machines leads to quick response through reallocation
  • HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives
  • $4k/TB (street), 2.5processors/TB, 1GB RAM/TB
  • 3 weeks from ordering to operational
  • HP Procurve 100baseT switches
    • About $40/port (street)
  • Load balancing by DNS round-robin, Cisco, Program
  • Network booting, so OS is re-installed on every boot
  • T3 to the Internet for $300/megabit/month
disk as tape
Disk as Tape
  • Tape is unreliable, specialized, slow, low density, not improving fast, and expensive
  • Using removable hard drives to replace tape’s function has been successful
  • When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used.
  • Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good.

Think “HAL” rather than StorageTek

(Idea by Jim Gray of Microsoft)

backup 3 scenarios
Backup: 3 scenarios
  • Disaster Recovery: Preservation through Replication
  • Hardware Faults: different solutions for different situations
    • Clusters,
    • load balancing,
    • replication,
    • tolerate machine/disk outages
    • (Avoided RAID and expensive, low volume solutions)
  • Programmer Error: slow replication, timestamped duplicates
operating system choices
Operating System Choices
  • Need: supportable, clusterable, improving, good support
  • Commodity, Remote operation of hundreds of nodes, free, source code (for documentation and inhouse fixes)
  • Reality of Evolution
    • Integrated Solaris/x86, FreeBSD, Linux
    • Solaris does not support IDE well,
    • FreeBSD does not thread well,
    • Linux does not NFS well, but has momentum
  • Linux is now our lead OS
parallel execution model
Parallel Execution Model
  • Datamining with command line interface
  • Controlling machines with 2TB of free space dispatches commands and data to parallel machines
  • Use flat files
  • Build explicit indexes
  • Use “sort” in datamining, Use binary searching for random access
  • P2 “grep pdf *.cdx | cut –fDATE|sort” –c “sort -m | uniq –c” –p $ARCHIVE
  • Non Programmers become parallel dataminers in less than 2 weeks
  • 500 queries/second on 100GB database
    • Queries on one key, uses about 10 tables
    • On 6 computers 2 database machines, 4 front ends
    • $20,000 (would be less today, but they are older and have 4GB RAM)
  • 10 queries/second on 100TB database
    • Index is on 16 computers, data is on 200 computers
    • 2 query types
    • $16,000 for index machines, $400k for all machines
  • General queries vary in speed
  • The “unit” is the $500 PC for added speed or capacity
  • Reconsider purchases from:
    • Oracle,
    • EMC,
    • Sun, IBM, HP, Compaq, Dell
    • Veritas,
    • Legato,
    • Exodus,
    • Your ISP,
    • Cisco
  • Our systems scale up well, are reliable, and are flexible because…

They are inexpensive.