1 / 19

CS597A: Managing and Exploring Large Datasets

CS597A: Managing and Exploring Large Datasets. Kai Li. About This Seminar. Goal: Identify research directions and issues in managing and exploring large datasets Plan: Overview of a few of state-of-the-art storage systems

gefen
Download Presentation

CS597A: Managing and Exploring Large Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS597A: Managing and Exploring Large Datasets Kai Li

  2. About This Seminar • Goal: • Identify research directions and issues in managing and exploring large datasets • Plan: • Overview of a few of state-of-the-art storage systems • Reading some papers on a few research systems in storage systems, data management and data exploration • Discussions on wild ideas • Define, work, and present course projects

  3. Why Is This Area Interesting?(Where Are The Bottlenecks?) Network Create Transform Transmit Store and Retrieve

  4. Computer Food Chains Supercomputer (Cray, etc) Mini-super (Convex, etc) Mainframe (IBM 370) Minicomputer (VAX) WS (SUN) PC (Computer systems in 1980s) Supercomputer (Cray, etc) Servers (IBM, SUN) PC Laptop PDA (Computer systems in 1990s and 2000s)

  5. Storage Arrays of Food Chains? Direct Attached Storage (DAS) USB, Microdrive, Flash ATA disks “Super” SCSI RAID ATA RAID Storage Area Network (SAN) “Super” SAN storage (EMC, Hitachi, IBM) “MiniSuper” SAN storage (HPQ, Startups) iSCSI (Startups) Network Attached Storage (NAS) PC storage (Dell, Snap!, MSFT SAK boxes) “Super” NAS (NetApp, SUN) “MiniSuper” NAS (Startups)

  6. Typical General Infrastructures File servers/wo disks Storage Area Network Network Backuptape library Mirroredstorage(e.g EMC) BCV or 3rd copy (e.g. EMC) Clients File servers/w disks Storage Area Network Network Backuptape library Clients

  7. Exponential Growth(Courtesy Jim Gray, Turing Lecture 99) • Performance/Price doubles every 18 months • 100x per decade • Progress in next 18 months = ALL previous progress • New storage = sum of all old storage (ever) • New processing = sum of all old processing. 15 years ago

  8. Disk Density vs. Moore’s Law

  9. Storage Capacity Grows Fast

  10. Disk drives beat tapes in 2002 in $/TB (IDC) Disk $/TB declines 50% / year Tape $/TB declines 29% / year But, ATA arrays ($/TB) beat tape libraries in 2006 (Gartner) Disk system $/TB declines 40%/year Tape library $/TB declines 29%/year Raw Storage Is Cheap 2006 $/TB 2002 (Source: Gartner and IDC)

  11. Summary of Storage Trends • Disk density beats Moore’s Law • Data growth rate follows Moore’s law • Raw disks are cheap while storage systems are very expensive • Crossover from tapes to disks

  12. How Much Information Is there?(Courtesy Jim Gray, Turing Lecture 99) Yotta Zetta Exa Peta Tera Giga Mega Kilo Everything! Recorded • Soon everything can be recorded and indexed • Most data never be seen by humans • Precious Resource: Human attentionAuto-Summarization Auto-Searchis key technology.www.lesk.com/mlesk/ksg97/ksg.html All BooksMultiMedia All LoC books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

  13. How Much Information Is There?(Hal Varian, Peter Lyman et al. 2001) • Web has a lot of documents • “Surface” web had 2.5B docs, adding 7.5M pages/day • “Deep” web had 550B docs, 95% publicly accessible • Most websites are in English • 78% all websites and 96% e-commerce • E-mail generates a large amount of information • A “white-collar” worker receives ~40 messages/day • E-mail information is 500x of web every year

  14. How Much Information Is There?(Hal Varian, Peter Lyman et al. 2001)

  15. Challenges In Managing and Exploring Datasets • Disk’s behavior is like a big tape • Storage is indeed “infinitely” large • Ability to get information is slow • Reliability is far from what we need • Disks do fail • Software and human corrupt data • Managing storage is difficult • Storage and data are both growing • Retrieving data is difficult • Get what you want • See what you get

  16. Properties of A Research Goal(Jim Gray, 1999) • Simple to state • Not obvious how to do it • Clear benefit • Progress and solution is testable • Can be broken in to smaller steps • So that you can see intermediate progress

  17. Systems Challenges(Lampson, SOSP Keynote 99) • Systems that work • Meeting their specs • Always available • Adapting to changing environment • Evolving while they run • Made from unreliable components • Growing without practical limit • Credible simulations or analysis • Writing good specs • Testing • Performance • Understanding when it doesn’t matter

  18. What Should the “New World” Focus Be?(Hennessy, FCRC keynote 99) • Availability • Both appliance & service • Maintainability • Two functions: • Enhancing availability by preventing failure • Ease of SW and HW upgrades • Scalability • Especially of service • Cost • per device and per service transaction • Performance • Remains important, but its not SPECint

  19. Tentative Syllabus • Today: About the Course • Week 2: Read several vision papers • Week 3: Guest lecture on archival storage • Week 4: Commercial storage systems (EMC, Veritas, NetApp) • Week 5: Global-scale storage (OceanStore and the like) • Week 6: Managing personal (Coda, Bayou, Personal RAID) • Week 7: Managing geographical data (TerraServer) • Week 8: Guest lecture on managing astrophysical data (SkyServer) • Week 9: Managing and exploring large scientific data • Week 10: Managing medical data • Week 11: Managing genomic data • Week 12: Project reports and presentations • Detailed, tentative reading will be available this weekend

More Related