1 / 28

Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006. Anke Kamrath Division Director, San Diego Supercomputer Center kamratha@sdsc.edu. The Digital World. Entertainment. Shopping. Information. GAMESS. Geosciences. Data Management and Mining.

jenis
Download Presentation

Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Welcome andCyberinfrastructure OverviewMSI Cyberinfrastructure InstituteJune 26-30, 2006 Anke Kamrath Division Director, San Diego Supercomputer Center kamratha@sdsc.edu

  2. The Digital World Entertainment Shopping Information

  3. GAMESS Geosciences Data Managementand Mining Astronomy Physics QCD Modeling and Simulation Science is a Team Sport Life Sciences

  4. Cyberinfrastructure – A Unifying Concept Cyberinfrastructure= resources(computers, data storage, networks, scientific instruments, experts, etc.) + “glue”(integrating software, systems, and organizations). NSF’s “Atkins Report” provided a compelling vision for integrated Cyberinfrastructure

  5. Data from instruments Data from sensors Data from simulations Data from analysis A Deluge of Data • Today data comes from everywhere • “Volunteer” data • Scientific instruments • Experiments • Sensors and sensornets • Computer simulations • New devices (personal digital devices, computer-enabled clothing, cars, …) • And is used by everyone • Researchers, educators • Consumers • Practitioners • General public • Turning the deluge of data into usable information for the research and education community requires an unprecedented level of integration, globalization, scale, and access Volunteer data

  6. SRB Summer Institute IT Using Data as a Driver: SDSC Cyberinfrastructure Community Databasesand Data Collections,Data management, mining and preservation Data-oriented HPC, Resources, High-end storage, Large-scale data analysis, simulation, modeling Biology Workbench SDSCData Cyberinfrastructure Data-oriented Tools, SW Applications, and Community Codes Data- and Computational Science Education and Training Collaboration, Service and Community Leadership for Data-oriented Projects

  7. wireless sensors field computer computer network network data computer data data storage computer viz network fieldinstrument Impact on Technology: Data and Storage are Integral to Today’s Information Infrastructure • Today’s “computer” is a coordinated set of hardware, software, and services providing an “end-to-end” resource. • Cyberinfrastructure captures how the research and education community has redefined “computer” Data and storage are an integral part of today’s “computer”

  8. Access to community and reference data collections More capable and/or higher capacity computational resources Community codes, middleware, software tools and toolkits Multi-disciplinary expertise Long-term Scienctific Data Preservation Building a National Data Cyberinfrastructure Center Goal: SDSC’s Data Cyberinfrastructure should “extend the reach” of the local research and education environment.

  9. Data (more BYTES) Impact on Applications: Data-oriented Research Driving the Next Generation of Technology Challenges Data-oriented Research Applications Home, Lab, Campus, Desktop Applications TraditionalHPC Applications Compute (more FLOPS)

  10. Data Mgt. Envt. Extreme I/O Environment Data-oriented Environment Climate SCEC Simulation SCEC Visualization ENZO simulation EOL NVO ENZO Visualization Turbulence field Lends itself to Grid GridSAT CFD CiPres Could be targeted efficiently on Grid MCell Seti@Home Data (more BYTES) Data (more BYTES) Difficult to target efficiently on Grid Protein Folding/MD Home, Lab, Campus, Desktop TraditionalHPC environment CPMD QCD GAMESS Turbulence Reattachment length EverQuest Compute (more FLOPS) Compute (more FLOPS) Today’s Research Applications Span the Spectrum

  11. Working with Compute and Data –Simulation, Analysis, Modeling Resources Required Computers and Systems • 80,000 hours on DataStar • 256 GB memory p690 used for testing, p655s used for production run, TG used for porting • 30 TB Global Parallel file GPFS • Run-time 100 MB/s data transfer from GPFS to SAM-QFS • 27,000 hours post-processing for high resolution rendering People • 20+ people for IT support • 20+ people in domain research Storage • SAM-QFS archival storage • HPSS backup • SRB Collection with 1,000,000 files Simulation of Southern of 7.7 earthquake on lower San Andreas Fault • Physics-based dynamic source model – simulation of mesh of 1.8 billion cubes with spatial resolution of 200 m • Builds on 10 years of data and models from the Southern California Earthquake Center • Simulated first 3 minutes of a magnitude 7.7 earthquake, 22,728 time steps of 0.011 second each • Simulation generates 45+ TB data

  12. The Southern San Andreas Fault Big Data & Big Compute: Simulating an earthquake 1: • Divide up Southern California into “blocks” • For each block, getall the data on ground surface composition, geological structures, fault information, etc.

  13. Big Data & Big Compute: Simulating earthquake 2: • Map the blocks on to processors(brains)of the computer SDSC’s DataStar – one of the 25 fastest computers in the world

  14. Big Data & Big Compute: Simulating an earthquake 3: • Run the simulation using current information on fault activity and the physics of earthquakes

  15. Managing the data Where to store the data? • In HPSS, a tape storage library that can hold 10 PetaByes (100000 Terabytes) -- 500 times the printed materials in the Library of Congress • Simulating an earthquake 4: • The simulation outputs data on seismic wave velocity, earthquake magnitude,and other characteristics • How much data was output? • 47 TeraByteswhich is • 2+ times the printed materials in the Library of Congress! or • The amount of music in 2000+ iPods! or • 47 million copies of a typical DVD movie!

  16. How long will TeraShake take on your desktop computer? 72 centuries! (approximate)

  17. Radiologists and neurosurgeons at Brigham and Women’s Hospital, Harvard Medical School exploring transmission of 30/40 MB brain images (generated during surgery) to SDSC for analysis and alignment Transmission repeated every hour during 6-8 hour surgery. Transmission and output must take on the order of minutes Finite element simulation on biomechanical model for volumetric deformation performed at SDSC; output results are sent to BWH where updated images are shown to surgeons Better Neurosurgery Through Cyberinfrastructure • PROBLEM:Neuro-surgeons seek to remove as much tumor tissue as possible while minimizing removal of healthy brain tissue • Brain deforms during surgery • Surgeons must align preoperative brain image with intra-operative images to provide surgeons the best opportunity for intra-surgical navigation

  18. Community Data Repository: SDSC DataCentral • Provides “data allocations” on SDSC resources to national science and engineering community • Data collection and database hosting • Batch oriented access • Collection management services • First broad program of its kind to support research and community data collections and databases • Comprehensive resources • Disk:400 TB accessible via HPC systems, Web, SRB, GridFTP • Databases:DB2, Oracle, MySQL • SRB:Collection management • Tape:6 PB, accessible via file system, HPSS, Web, SRB, GridFTP • 24/7 operations, collection specialists Example Allocated Data Collections include • Bee Behavior (Behavioral Science) • C5 Landscape DB (Art) • Molecular Recognition Database(Pharmaceutical Sciences) • LIDAR (Geoscience) • AMANDA (Physics) • SIO_Explorer (Oceanography) • Tsunami and Landsat Data (Earthquake Engineering) • Terabridge (Structural Engineering) DataCentral infrastructure includes: Web-based portal, security, networking, UPS systems, web services and software tools

  19. Public Data Collections Hosted in SDSC’s DataCentral

  20. How do we combine data, knowledgeand information management with simulation and modeling? Applications: Medical informatics, Biosciences, Ecoinformatics,… Visualization How do we represent data, information and knowledge to the user? Data Mining, Simulation Modeling, Analysis, Data Fusion How do we detect trends and relationships in data? Knowledge-Based Integration Advanced Query Processing How do we obtain usableinformation from data? Grid Storage Filesystems, Database Systems How do we collect, accessand organize data? How do we configure computer architectures to optimally support data-oriented computing? High speed networking Networked Storage (SAN) sensornets instruments Storage hardware HPC Data Cyberinfrastructure Requires a Coordinated Approach interoperability integration

  21. Data Integration in the Biosciences Data Integration in the Geosciences Software to access data Software to federate data Anatomy Disciplinary Databases Users Physiology Organisms Organs Cell Biology Cells Proteomics Organelles Genomics Bio-polymers Medicinal Chemistry Atoms Working with Data: Data Integration for New Discovery Where can we most safely build a nuclear waste dump? Where should we drill for oil? What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How does it relate to host rock structures? Data Integration Complex “multiple-worlds” mediation Geo-Physical Geo-Chronologic Geo-Chemical Foliation Map Geologic Map

  22. Preserving Data over the Long-Term

  23. Data Preservation • Many Science, Cultural, and Official Collections must be sustained for the foreseeable future • Critical collections must be preserved: • community reference data collections(e.g. Protein Data Bank) • irreplaceable collections(e.g. field data – tsunami recon) • longitudinal data(e.g. PSID – Panel Study of Income Dynamics) • No plan for preservation often means that data is lost or damaged “….the progress of science and useful arts … depends on the reliable preservation of knowledge and information for generations to come.” “Preserving Our Digital Heritage”, Library of Congress

  24. How much Digital Data*? iPod Shuffle (up to 120 songs) = 512 MegaBytes Printed materials in the Library of Congress = 10 TeraBytes 1 human brain at the micron level= 1 PetaByte SDSC HPSS tape archive = 6 PetaBytes 1 novel = 1 MegaByte All worldwide information in one year = 2 ExaBytes 1 Low Resolution Photo = 100 KiloBytes * Rough/average estimates

  25. Key Challenges for Digital Preservation • What should we preserve? • What materials must be “rescued”? • How to plan for preservation of materials by design? • How should we preserve it? • Formats • Storage media • Stewardship – who is responsible? • Who should pay for preservation? • The content generators? • The government? • The users? • Who should have access? Print media provides easy access for long periods of time but is hard to data-mine Digital media is easier to data-mine but requires management of evolution of media and resource planning over time

  26. What can go wrong

  27. SDSC Cyberinfrastructure Community Resources DATA ENVIRONMENT • 1 PB Storage-area Network (SAN) • 10 PB StorageTek tape library • DB2, Oracle, MySQL • Storage Resource Broker • HPSS • 72-CPU Sun Fire 15K • 96-CPU IBM p690s • http://datacentral.sdsc.edu/ Support for 60+ community data collections and databases Data management, mining, analysis, and preservation COMPUTE SYSTEMS • DataStar • 2396 Power4+ processors, IBM p655 and p690 nodes • 10 TB total memory • Up to 2 GBps I/O to disk • TeraGrid Cluster • 512 Itanium2 IA-64 processors • 1 TB total memory • Intimidata • Only academic IBM Blue Gene system • 2,048 PowerPC processors • 128 I/O nodes http://www.sdsc.edu/user_services/ SCIENCE and TECHNOLOGY STAFF, SOFTWARE, SERVICES • User Services • Application/Community Collaborations • Education and Training • SDSC Synthesis Center • Community SW, toolkits, portals, codes • http://www.sdsc.edu/

  28. Thank You kamratha@sdsc.edu www.sdsc.edu

More Related