A 100 terabyte database system for 500k
Advertisement
This presentation is the property of its rightful owner.
1 / 12

Accessing PowerPoint PPT Presentation

Alexa Internet is a wholly owned subsidiary of. A 100 Terabyte Database System for $500k ... President, Alexa Internet. Director, Internet Archive. May, 2001 ...

Download Presentation

Accessing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A 100 terabyte database system for 500k

A 100 Terabyte Database System for $500k

Brewster Kahle

President, Alexa Internet

Director, Internet Archive


Hillis s law

“Hillis’s Law”

  • Price follows Volume

  • Corollary: Reliability follows Volume

  • Corollary: Availability follows Volume

  • For Reliability and Availability … buy inexpensive components


Database defined

“Database” defined

  • Queryable, updateable, persistent data collection

  • This example:

    • 10billion WWW pages from 1996 till now

    • Metadata associated with those pages and websites

    • Various auxiliary tables and programs

  • Queryable: Retrieval of records based on fuzzy matches, of date, URL, other attributes


Approach to large scale computation

Approach to Large Scale Computation

  • Need: Scale, Reliability, Flexibility, Evolution, Low Risk

  • Solution:

    • Commodity Hardware

    • Commodity Operating Systems

    • Commodity Software

    • Commodity Programmers


Hardware

Hardware

  • Homogenous machines leads to quick response through reallocation

  • HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives

  • $4k/TB (street), 2.5processors/TB, 1GB RAM/TB

  • 3 weeks from ordering to operational


Networking

Networking

  • HP Procurve 100baseT switches

    • About $40/port (street)

  • Load balancing by DNS round-robin, Cisco, Program

  • Network booting, so OS is re-installed on every boot

  • T3 to the Internet for $300/megabit/month


Disk as tape

Disk as Tape

  • Tape is unreliable, specialized, slow, low density, not improving fast, and expensive

  • Using removable hard drives to replace tape’s function has been successful

  • When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used.

  • Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good.

    Think “HAL” rather than StorageTek

    (Idea by Jim Gray of Microsoft)


Backup 3 scenarios

Backup: 3 scenarios

  • Disaster Recovery: Preservation through Replication

  • Hardware Faults: different solutions for different situations

    • Clusters,

    • load balancing,

    • replication,

    • tolerate machine/disk outages

    • (Avoided RAID and expensive, low volume solutions)

  • Programmer Error: slow replication, timestamped duplicates


Operating system choices

Operating System Choices

  • Need: supportable, clusterable, improving, good support

  • Commodity, Remote operation of hundreds of nodes, free, source code (for documentation and inhouse fixes)

  • Reality of Evolution

    • Integrated Solaris/x86, FreeBSD, Linux

    • Solaris does not support IDE well,

    • FreeBSD does not thread well,

    • Linux does not NFS well, but has momentum

  • Linux is now our lead OS


Parallel execution model

Parallel Execution Model

  • Datamining with command line interface

  • Controlling machines with 2TB of free space dispatches commands and data to parallel machines

  • Use flat files

  • Build explicit indexes

  • Use “sort” in datamining, Use binary searching for random access

  • P2 “grep pdf *.cdx | cut –fDATE|sort” –c “sort -m | uniq –c” –p $ARCHIVE

  • Non Programmers become parallel dataminers in less than 2 weeks


Performance

Performance

  • 500 queries/second on 100GB database

    • Queries on one key, uses about 10 tables

    • On 6 computers 2 database machines, 4 front ends

    • $20,000 (would be less today, but they are older and have 4GB RAM)

  • 10 queries/second on 100TB database

    • Index is on 16 computers, data is on 200 computers

    • 2 query types

    • $16,000 for index machines, $400k for all machines

  • General queries vary in speed

  • The “unit” is the $500 PC for added speed or capacity


Suggestions

Suggestions

  • Reconsider purchases from:

    • Oracle,

    • EMC,

    • Sun, IBM, HP, Compaq, Dell

    • Veritas,

    • Legato,

    • Exodus,

    • Your ISP,

    • Cisco

  • Our systems scale up well, are reliable, and are flexible because…

    They are inexpensive.


  • Login