High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST

High Performance Presentation:5 slides/Minute?(65 slides / 15 minutes)IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 TerraServer Lessons Learned • Hardware is 5 9’s (with clustering) • Software is 5 9’s (with clustering) • Admin is 4 9’s (offline maintenance) • Network is 3 9’s (mistakes, environment) • Simple designs are best • 10 TB DB is management limit1 PB = 100 x 10 TB DBthis is 100x better than 5 years ago.(yahoo!, HotMail are 300TB, Google! Is 2PB) • Minimize use of tape • Backup to disk (snapshots) • Portable disk TBs

.2x.2 km2 tile .4x.4 km2 image .8x.8 km2 image 1.6x1.6 km2 image Serving BIG images • Break into tiles (compressed): • 10KB for modems • 1MB for LANs • Mosaic the tiles for pan, crop • Store image pyramid for zoom • 2x zoom only adds 33% overhead1 + ¼ + 1/16 + … • Use a spatial index to cluster & find objects

Economics • People are more than 50% of costs • Disks are more than 50% of capital • Networking is the other 50% • People • Phone bill • Routers • Cpus are free (they come with the disks)

SkyServer/ SkyQuery Lessons • DB is easy • Search • It is BEST to index • You can put objects and attributes in a row (SQL puts big blobs off-page) • If you can’t index, you can extract attributes and quickly compare • SQL can scan at 5M records/cpu/second • Sequential scans are embarrassingly parallel • Web services are easy • XML Data Sets : • a universal way to represent answers • minimize round trips: 1 request/response • Diffgrams allow disconnected update

How Will We Find Stuff?Put everything in the DB (and index it) • Need dbms features: Consistency, Indexing, Pivoting, Queries, Speed/scalability, Backup, replicationIf you don’t use one, you’r creating one! • Simple logical structure: • Blob and link is all that is inherent • Additional properties (facets == extra tables)and methods on those tables (encapsulation) • More than a file system • Unifies data and meta-data • Simpler to manage • Easier to subset and reorganize • Set-oriented access • Allows online updates • Automatic indexing, replication SQL

How Do We Represent Data To The Outside World? • <?xml version="1.0" encoding="utf-8" ?> • -<DataSet xmlns="http://WWT.sdss.org/"> • -<xs:schema id="radec" xmlns="" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:msdata="urn:schemas-microsoft-com:xml-msdata"> • <xs:element name="radec" msdata:IsDataSet="true"> • <xs:element name="Table"> • <xs:elementname="ra" type="xs:double" minOccurs="0" /> • <xs:elementname="dec" type="xs:double" minOccurs="0" /> • … • -<diffgr:diffgram xmlns:msdata="urn:schemas-microsoft-com:xml-msdata" xmlns:diffgr="urn:schemas-microsoft-com:xml-diffgram-v1"> • -<radec xmlns=""> • -<Table diffgr:id="Table1" msdata:rowOrder="0"> • <ra>184.028935351008</ra> • <dec>-1.12590950121524</dec> • </Table> • … • -<Table diffgr:id="Table10" msdata:rowOrder="9"> • <ra>184.025719033547</ra> • <dec>-1.21795827920186</dec> • </Table> • </radec> • </diffgr:diffgram> • </DataSet> • File metaphor too primitive: just a blob • Table metaphor too primitive: just records • Need Metadata describing data context • Format • Providence (author/publisher/ citations/…) • Rights • History • Related documents • In a standard format • XML and XML schema • DataSet is great example of this • World is now defining standard schemas schema Data or difgram

Emerging Concepts • Standardizing distributed data • Web Services, supported on all platforms • Custom configure remote data dynamically • XML: Extensible Markup Language • SOAP: Simple Object Access Protocol • WSDL: Web Services Description Language • DataSets: Standard representation of an answer • Standardizing distributed computing • Grid Services • Custom configure remote computing dynamically • Build your own remote computer, and discard • Virtual Data: new data sets on demand

Crab star 1053 AD X-ray, optical, infrared, and radio views of the nearby Crab Nebula, which is now in a state of chaotic expansion after a supernova explosion first sighted in 1054 A.D. by Chinese Astronomers. Szalay’s Law:The utility of N comparable datasets is N2 • Metcalf’s law applies to telephones, fax, Internet. • Szalay argues as follows:Each new dataset gives new information2-way combinations give new information. • Example: Combine these 3 datasets • (ID, zip code) • (ID, birth day) • (ID, height) • Other example: quark star: Chandra Xray + Hubble optical,+600 year old records..Drake, J. J. et al. Is RX J185635-375 a Quark Star?. Preprint, (2002).

You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. Oh!, and 1PB ~10,000 disks At some point you need indices to limit searchparallel data search and analysis search and analysis tools This is where databases can help You can FTP 1 MB in 1 sec You can FTP 1 GB / min (= 1 $/GB) … 2 days and 1K$ … 3 years and 1M$ Science is hitting a wallFTP and GREP are not adequate

Networking: Great hardware & Software • WANs @ 5GBps (1 = 40 Gbps) • GbpsEthernet common (~100 MBps) • Offload gives ~2 hz/Byte • Will improve with RDMA & zero-copy • 10 Gbps mainstream by 2004 • Faster I/O • 1 GB/s today (measured) • 10 GB/s under development • SATA (serial ATA) 150MBps/device

1 fiber = 25 Tbps Bandwidth: 3x bandwidth/year for 25 more years • Today: • 40 Gbps per channel (λ) • 12 channels per fiber (wdm): 500 Gbps • 32 fibers/bundle = 16 Tbps/bundle • In lab 3 Tbps/fiber (400 x WDM) • In theory 25 Tbps per fiber • 1 Tbps = USA 1996 WAN bisection bandwidth • Aggregate bandwidth doubles every 8 months!

Hero/Guru Networking Redmond/Seattle, WA Information Sciences Institute Microsoft Qwest University of Washington Pacific Northwest Gigapop HSCC (high speed connectivity consortium) DARPA New York Arlington, VA San Francisco, CA 5626 km 10 hops

Real Networking • Bandwidth for 1 Gbps “stunt” cost 400k$/month • ~ 200$/Mbps/m (at each end + hardware + admin) • Price not improving very fast • Doesn’t include operations / local hardware costs • Admin… costs more ~1$/GB to 10$/GB • Challenge: Go home and FTP from a “fast”server • The Guru Gap: FermiLab <-> JHU • Both “well connected” • vBNS, NGI, Internet2, Abilene,…. • Actual desktop-to-desktop ~ 100KBps • 12 days/TB (but it crashes first). • The reality: to move 10GB, mail it! TeraScale Sneakernet 

SpeedMbps Rent$/month $/TBSent Context $/Mbps Time/TB 0.04 40 1,000 3,086 6 years Home phone Home DSL 0.6 70 117 360 5 months T1 1.5 1,200 800 2,469 2 months T3 43 28,000 651 2,010 2 days OC3 155 49,000 316 976 14 hours OC 192 9600 1,920,000 200 617 14 minutes 100 Mpbs 100 1 day Gbps 1000 2.2 hours How Do You Move A Terabyte? Source: TeraScale Sneakernet, Microsoft Research, Jim Gray et. all

There Is A Problem Niklaus Wirth: Algorithms + Data Structures = Programs • GREAT!!!! • XML documents are portable objects • XML documents are complex objects • WSDL defines the methods on objects (the class) • But will all the implementations match? • Think of UNIX or SQL or C or… • This is a work in progress.

Changes To DBMS’s • Integration of Programs and Data • Put programs inside the databaseallows OODB • Gives you parallel execution • Integration of Relational, Text, XML, Time • Scaleout (even more) • AutoAdmin (“no knobs”) • Manage Petascale databases (utilities, geoplex, online, incremental)

Publishing Data Roles Authors Publishers Curators Archives Consumers Traditional Scientists Journals Libraries Archives Scientists Emerging Collaborations Project web site Data+Doc Archives Digital Archives Scientists

The Core Problem: No Economic Model • The archive user has not yet been born. How can he pay you to curate the data? • The Scientist gathered data for his own purposeWhy should he pay (invest time) for your needs? • Answer to both:that’s the scientific method • Curating data (documenting the design, the acquisition and the processing)Is very hard and there is no reward for doing it.The results are rewarded, not the process of getting them. • Storage/archive NOT the problem (it’s almost free) • Curating/Publishing is expensive.

Level 1AGrows 5TB pixels/year growing to 25TB~ 2 TB/y compressed growing to 13TB~ 4 TB today (level 1A in NASA terms) Level 2Derived data products ~10x smaller But there are many catalogs. Publish new edition each year Fixes bugs in data. Must preserve old editions Creates data pyramid Store each edition 1, 2, 3, 4… N ~ N2 bytes Net: Data Inflation: L2 ≥ L1 Level 1A 4 editions of Level 2 products E4 E3 E2 time E1 4 editions oflevel 1A data(source data) 4 editions of level 2 derived data products. Note that each derived product is small, but they are numerous. This proliferation combined with the data pyramid implies that level2 data more than doubles the total storage volume. SDSS Data Inflation – Data Pyramid

Data Mining Algorithms Miners Scientists Science Data & Questions Database To store data Execute Queries Plumbers Question & AnswerVisualization Tools What’s needed?(not drawn to scale)

Scientists Science Data & Questions CS Challenges For Astronomers • Objectify your field: • Precisely define what you are talking about. • Objects and Methods / Attributes • This is REALLY difficult. • UCDs are a great start but, there is a long way to go • “Software is like entropy, it always increases.” -- Norman Augustine, Augustine’s Laws • Beware of legacy software – cost can eat you alive • Share software where possible. • Use standard software where possible. • Expect it will cost you 25% to 40% of project.  • Explain what you want to do with the VO • 20 queries or something like that.

Data Mining Algorithms Miners Challenge to Data Miners: Linear and Sub-Linear Algorithms Techniques • Today most correlation / clustering algorithmsare polynomial N2 or N3 or… • N2 is VERY big when N is big (1018 is big) • Need sub-linear algorithms • Current approaches are near optimal given current assumptions. • So, need new assumptionsprobably heuristic and approximate

Data Mining Algorithms Miners Challenge to Data Miners: Rediscover Astronomy • Astronomy needs deep understanding of physics. • But, some was discovered as variable correlations then “explained” with physics. • Famous example: Hertzsprung-Russell Diagramstar luminosity vs color (=temperature) • Challenge 1 (the student test): How much of astronomy can data mining discover? • Challenge 2 (the Turing test):Can data mining discover NEW correlations?

Plumbers Database To store data Execute Queries Plumbers: Organize and Search Petabytes • Automate • instrument-to-archive pipelinesIt is is a messy business – very labor intensiveMost current designs do not scale (too many manual steps)BaBar (1TB/day) and ESO pipeline seem promising.A job-scheduling or workflow system • Physical Database design & access • Data access patterns are difficult to anticipate • Aggressively and automatically use indexing, sub-setting. • Search in parallel • Goals • Answer easy queries in 10 seconds. • Answer hard queries (correlations) in 10 minutes.

Scale UP Scaleable Systems • Scale UP: grow by adding components to a single system. • Scale Out: grow by adding more systems. Scale OUT

What’s New – Scale Up • 64 bit & TB size main memory • SMP on chip: everything’s smp • 32… 256 SMP: locality/affinity matters • TB size disks • High-speed LANs

Who needs 64-bit addressing?You! Need 64-bit addressing! • 640K ought to be enough for anybody. Bill Gates, 1981 • But that was 21 years ago == 221/3 = 14 bits ago. • 20 bits + 14 bits = 34 bits so.. 16GB ought to be enough for anybodyJim Gray, 2002 • 34 bits > 31 bits so…34 bits == 64 bits • YOU need 64 bit addressing!

64 bit – Why bother? • 1966 Moore’s law: 4x more RAM every 3 years. 1 bit of addressing every 18 months • 36 years later: 236/3 = 24 more bits Not exactly right, but…32 bits not enough for servers 32 bits gives no headroom for clients So, time is running out ( has run out ) • Good news: Itanium™ and Hammer™ are maturingAnd so is the base software (OS, drivers, DB, Web,...)Windows & SQL @ 256GB today!

decade year month week day 64 bit – why bother? • Memory intensive calculations: • You can trade memory for IO and processing • Example: Data Analysis & Clustering a JHU • in memory CPU time is ~NlogN , N ~ 100M • Disk M chunks → time ~ M2 • must run many times • Now running on HP Itanium Windows.Net Server 2003 SQL Server Graph courtesy of Alex Szalay & Adrian Pope of Johns Hopkins University

Amdahl’s balanced System Laws • 1 mips needs 4 MB ram and needs 20 IO/s • At 1 billion instructions per secondneed 4 GB/cpuneed 50 disks/cpu! • 64 cpus … 3,000 disks 1 bips cpu 4 GB RAM 50 disks 10,000 IOps 7.5 TB

The 5 Minute Rule – Trade RAM for Disk Arms • If data re-referenced every 5 minutes It is cheaper to cache it in ram than to get it from diskA disk access/second ~ 50$ or ~ 50MB for 1 second or ~ 50KB for 1,000 seconds. • Each app has a memory “knee” Up to the knee, more memory helps a lot.

64 bit Reduces IO, saves disks • Large memory reduces IO • 64-bit simplifies code • Processors can be faster (wider word) • Ram is cheap (4 GB ~ 1k$ to 20k$) • Can trade ram for disk IO • Better response time. • Example • tpcC • 4x1Ghz Itanium2 vs • 4x1.6Ghz IA32 • 40 extra GB → 60% extra throughput 4x1.6Ghz IA32 8GB 4x1.6Ghz IA32 32GB 4x1 Ghz IA64 48GB

AMD Hammer™ Coming Soon • AMD Hammer™ is 64bit capable • 2003: millions of Hammer™ CPUs will ship • 2004: most AMD CPUs will be 64bit • 4GB ram is less than 1,000$ today less than 500$ in 2004 • Desktops (Hammer™) and servers (Opteron™). • You do the math,…Who will demand 64bit capable software?

A 1TB Main Memory • Amdahl’s law: 1mips/MB , now 1:5so ~20 x 10 Ghz cpus need 1TB ram • 1TB ram ~ 250k$ … 2m$ today ~ 25k$ … 200k$ in 5 years • 128 million pages • Takes a LONG time to fill • Takes a LONG time to refill • Needs new algorithms • Needs parallel processing • Which leads us to… • The memory hierarchy • smp • numa

Hyper-Threading: SMP on chip • If cpu is always waiting for memoryPredict memory requests and prefetch • done • If cpu still always waiting for memoryMulti-program it (multiple hardware threads per cpu) • Hyper Threading: Everything is SMP • 2 now more later • Also multiple cpus/chip • If your program is single threaded • You waste ½ the cpu and memory bandwidth • Eventually waste 80% • App builders need to plan for threads.

The Memory Hierarchy • Locality REALLY matters • CPU 2 G hz, RAM at 5 MhzRAM is no longer random access. • Organizing the code gives 3x (or more) • Organizing the data gives 3x (or more) • Level latency (clocks) size • Registers 1 1 KB • L1 2 32 KB • L2 10 256 KB • L3 30 4 MB • Near RAM 100 16 GB • Far RAM 300 64 GB

Disk Network Other Cpus Other Cpus Other Cpus Other Cpus Remote RAM Remote RAM RAM Remote cache The Bus L2 cache Off chip L1 cache Dcache Icache registers Arithmatic Logical Unit

CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU I/O I/O I/O I/O Chipset Chipset Chipset Chipset Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Scaleup Systems Non-Uniform Memory Architecture (NUMA)Coherent but… remote memory is even slower All cells see a common memory Slow local main memory Slower remote main memory Partition manager Service Processor Scaleup by adding cells Planning for 64 cpu, 1TB ram Config DB Service Processor Interconnect, Service Processor, Partition management are vendor specific Several vendors doing thisItanium and Hammer System interconnect Crossbar/Switch

Changed Ratios Matter • If everything changes by 2x, Then nothing changes. • So, it is the different rates that matter. Slowly changing Speed of light People costs Memory bandwidth WAN prices Improving FAST CPU speed Memory & disk size Network Bandwidth

Disks are becoming tapes 150 GB • Capacity: • 150 GB now, 300 GB this year, 1 TB by 2007 • Bandwidth: • 40 MBps now150 MBps by 2007 • Read time • 2 hours sequential, 2 days random now4 hours sequential, 12 days random by 2007 150 IO/s 40 MBps 1 TB 200 IO/s 150 MBps

Disks are becoming tapesConsequences • Use most disk capacity for archivingCopy on Write (COW) file system in Windows and other OSs. • RAID10 saves arms, costs space (OK!). • Backup to diskPretend it is a 100GB disk + 1 TB disk • Keep hot 10% of data on fastest part of disk. • Keep cold 90% on colder part of disk • Organize computations to read/write disks sequentially in large blocks.

8xSATA150MBps/link Enet 100MBps/link Wiring is going serial and getting FAST! • Gbps Ethernet and SATA built into chips • Raid Controllers: inexpensive and fast. • 1U storage bricks @ 2-10 TB • SAN or NAS (iSCSI or CIFS/DAFS)

NAS – SAN Horse Race • Storage Hardware 1k$/TB/yStorage Management 10k$...300k$/TB/y • So as with Server ConsolidationStorage Consolidation • Two styles: NAS (Network Attached Storage) File Server SAN (System Area Network) Disk Server • I believe NAS is more manageable.

SAN/NAS Evolution Monolithic Modular Sealed

200 160 120 80 40 0 0 5000 10000 15000 20000 IO ThroughputK Access Per Second Vs. RPM Kaps vs. RPM Kaps

Model # Size Speed Connect. Cost $/K Rev 40 GB 5400 RPM ATA $86 $15.9 ATA 100 ATA 1000 40 GB 7200 RPM ATA $101 $14.0 36 ES 2 36.7 GB 10K RPM SCSI $325 $32.5 X15 36LP 36.7 GB 15K RPM SCSI $455 $29.7 X15 36LP 36.7 GB 15K RPM Fibre $455 $29.7 Comparison Of Disk Cost$’s for similar performance Seagate Disk Prices* *Source: Seagate online store, quantity one prices

Mfg. Size Type Cost Cost/MB 80 GB Int. ATA $115 1.4¢ Dell WD EMC 120 GB XX GB Ext. ATA SAN $276 2.3¢ xx¢ Seagate 181 GB Int SCSI $1155 6.4¢ Comparison Of Disk Costs¢/MB for different systems Source: Dell

Why Serial ATA Matters • Modern interconnect • Point-to-point drive connection • 150Mbs –> 300Mbs • Facilitates ATA disk arrays • Enables inexpensive“cool” storage

Performance (on Y2k SDSS data) • Run times: on 15k$ HP Server (2 cpu, 1 GB , 8 disk) • Some take 10 minutes • Some take 1 minute • Median ~ 22 sec. • Ghz processors are fast! • (10 mips/IO, 200 ins/byte) • 2.5 m rec/s/cpu ~1,000 IO/cpu sec ~ 64 MB IO/cpu sec

High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST