1 / 12

Accessing

Alexa Internet is a wholly owned subsidiary of. A 100 Terabyte Database System for $500k ... President, Alexa Internet. Director, Internet Archive. May, 2001 ...

No1City
Download Presentation

Accessing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A 100 Terabyte Database System for $500k Brewster Kahle President, Alexa Internet Director, Internet Archive

  2. “Hillis’s Law” • Price follows Volume • Corollary: Reliability follows Volume • Corollary: Availability follows Volume • For Reliability and Availability … buy inexpensive components

  3. “Database” defined • Queryable, updateable, persistent data collection • This example: • 10billion WWW pages from 1996 till now • Metadata associated with those pages and websites • Various auxiliary tables and programs • Queryable: Retrieval of records based on fuzzy matches, of date, URL, other attributes

  4. Approach to Large Scale Computation • Need: Scale, Reliability, Flexibility, Evolution, Low Risk • Solution: • Commodity Hardware • Commodity Operating Systems • Commodity Software • Commodity Programmers

  5. Hardware • Homogenous machines leads to quick response through reallocation • HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives • $4k/TB (street), 2.5processors/TB, 1GB RAM/TB • 3 weeks from ordering to operational

  6. Networking • HP Procurve 100baseT switches • About $40/port (street) • Load balancing by DNS round-robin, Cisco, Program • Network booting, so OS is re-installed on every boot • T3 to the Internet for $300/megabit/month

  7. Disk as Tape • Tape is unreliable, specialized, slow, low density, not improving fast, and expensive • Using removable hard drives to replace tape’s function has been successful • When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used. • Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good. Think “HAL” rather than StorageTek (Idea by Jim Gray of Microsoft)

  8. Backup: 3 scenarios • Disaster Recovery: Preservation through Replication • Hardware Faults: different solutions for different situations • Clusters, • load balancing, • replication, • tolerate machine/disk outages • (Avoided RAID and expensive, low volume solutions) • Programmer Error: slow replication, timestamped duplicates

  9. Operating System Choices • Need: supportable, clusterable, improving, good support • Commodity, Remote operation of hundreds of nodes, free, source code (for documentation and inhouse fixes) • Reality of Evolution • Integrated Solaris/x86, FreeBSD, Linux • Solaris does not support IDE well, • FreeBSD does not thread well, • Linux does not NFS well, but has momentum • Linux is now our lead OS

  10. Parallel Execution Model • Datamining with command line interface • Controlling machines with 2TB of free space dispatches commands and data to parallel machines • Use flat files • Build explicit indexes • Use “sort” in datamining, Use binary searching for random access • P2 “grep pdf *.cdx | cut –fDATE|sort” –c “sort -m | uniq –c” –p $ARCHIVE • Non Programmers become parallel dataminers in less than 2 weeks

  11. Performance • 500 queries/second on 100GB database • Queries on one key, uses about 10 tables • On 6 computers 2 database machines, 4 front ends • $20,000 (would be less today, but they are older and have 4GB RAM) • 10 queries/second on 100TB database • Index is on 16 computers, data is on 200 computers • 2 query types • $16,000 for index machines, $400k for all machines • General queries vary in speed • The “unit” is the $500 PC for added speed or capacity

  12. Suggestions • Reconsider purchases from: • Oracle, • EMC, • Sun, IBM, HP, Compaq, Dell • Veritas, • Legato, • Exodus, • Your ISP, • Cisco • Our systems scale up well, are reliable, and are flexible because… They are inexpensive.

More Related