developing cyberinfrastructure for data oriented science and engineering n.
Skip this Video
Loading SlideShow in 5 Seconds..
Developing Cyberinfrastructure for Data-Oriented Science and Engineering PowerPoint Presentation
Download Presentation
Developing Cyberinfrastructure for Data-Oriented Science and Engineering

Loading in 2 Seconds...

play fullscreen
1 / 27

Developing Cyberinfrastructure for Data-Oriented Science and Engineering - PowerPoint PPT Presentation

  • Uploaded on

Developing Cyberinfrastructure for Data-Oriented Science and Engineering. Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed Chair, UC San Diego. The Digital World. Education. Entertainment. Commerce. Information.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Developing Cyberinfrastructure for Data-Oriented Science and Engineering' - basil-turner

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
developing cyberinfrastructure for data oriented science and engineering

Developing Cyberinfrastructure for Data-Oriented Science and Engineering

Dr. Francine Berman

Director, San Diego Supercomputer Center

Professor and High Performance Computing Endowed Chair, UC San Diego

the digital world
The Digital World





research education and data
Research, Education, and Data

Japanese Art Images – 70.6 GB


NVO – 100+ TB

Life Sciences


Arts, and Humanities


Projected LHC Data – 10 PB/year

SCEC – 153 TB

TeraBridge – 800 GB



today s research and education applications cover the spectrum

Data (more BYTES)

Today’s Research and Education Applications Cover the Spectrum

Data-oriented Applications

Large-scale data required as input, intermediate, output for many modern HPC applications

Applications vary with respect to how well they can perform in distributed mode (grid computing)

Researchers increasingly dependent on both High Performance Computing (HPC)and Highly Reliable Data



PDB applications


Home, Lab, Campus, Desktop


Medium, Large, and Leadership HPC


World ofWarcraft



Compute (more FLOPS)

sdsc mission
The mission of the San Diego Supercomputer Center (SDSC) is to empower communities in data-oriented research, education, and practice through the innovation and provision of CyberinfrastructureSDSC Mission

Cyberinfrastructure=resources(computers, data storage, networks, scientific instruments, experts, etc.) + “glue”(integrating software, systems, and organizations).

sdsc cyberinfrastructure
SDSC Cyberinfrastructure
  • 2.4 PB Storage-area Network (SAN)
  • 25 PB StorageTek/IBM tape library
  • HPSS and SAM-QFS archival systems
  • DB2, Oracle, MySQL
  • Storage Resource Broker
  • Supporting servers: IBM 32-way p690s,
  • 72-CPU SunFire 15K, etc.

Support for community data collections and databases

Data management, mining, analysis, and preservation


  • DataStar
    • 15.6 TFLOPS Power 4+ system
    • 7.125 TB total memory
    • Up to 4 GBps I/O to disk
    • 115 TB GPFS filesystem
  • Blue Gene Data
    • First academic IBM Blue Gene system
    • 17.1 TF
    • 1.5 TB total memory
    • 3 racks, each with 2,048 PowerPC processors and 128 I/O nodes
  • TeraGrid Cluster
    • 524 Itanium2 IA-64 processors
    • 2 TB total memory
    • Also 16 2-way data I/O nodes


  • User Services
  • Application/Community Collaborations
  • Education and Training
  • SDSC Synthesis Center
  • Data-oriented Community SW, toolkits, portals, codes
major earthquakes on the san andreas fault 1680 present

Data and Simulations -- TeraShake

Major Earthquakes on the San Andreas Fault, 1680-present
  • Simulation results provide new scientific information enabling better
    • Estimation of seismic risk
    • Emergency preparation, response and planning
    • Design of next generation of earthquake-resistant structures
  • Results provide information which can help in saving many lives and billions in economic losses
  • Researchers use geological, historical, and environmental data to simulate massive earthquakes.
  • These simulations are critical to understand seismic movement, and assess potential impact.


M 7.8



M 7.8


M 7.7

How dangerous is the San Andreas Fault?

terashake visualization
TeraShake Visualization


  • Simulation of Southern of 7.7 earthquake on lower San Andreas Fault
    • Physics-based dynamic source model – simulation of mesh of 1.8 billion cubes with spatial resolution of 200 m
    • Simulated first 3 minutes of a magnitude 7.7 earthquake, 22,728 time steps of 0.011 second each
    • Simulation generates 45+ TB data

Project leadership:

Tom Jordan (SCEC), Bernard Minster (SIO) Reagan Moore (SDSC), Carl Kesselman (ISI)

terashake data choreography

Data parking of 100s of TBs for many months

“Fat Nodes” with 256 GB of DS for pre-processing and post visualization

10-20 TB data archived a day

240 procs on

SDSC Datastar,

5 days, 1 TBof main memory

47 TB output data for 1.8 billion grid points

Continuous I/O 2GB/sec

TeraShake Data Choreography

Resources must support a complicated orchestration of computation and data movement

Parallelfile system


Finer resolution simulations require even more resources. TeraShake being scaled to run on petascale architectures

“I have desired to see a large earthquake simulation for over a decade. This dream has been accomplished.” 

Bernard Minster, Scripps Institute of Oceanography

TeraShake at Petascale – A qualitative difference in prediction accuracy creates even greater data infrastructure demands
  • Petascale platform will allow much higher resolution
  • for very accurate prediction
  • of ground motion

Estimates courtesy of the Southern California Earthquake Center

data as a driver the protein data bank a resource for the global biology community
Data as a driver – the Protein Data Bank: A resource for the global Biology community

The Protein Data Bank

  • Largest repository on the planet for structural information about proteins
  • Provides free worldwide public access 24/7 to accurate protein data
  • PDB maintained by the Worldwide PDB administered by the Research Collaboratory for Structural Bioinformatics (RCSB), directed by Helen Berman

Each structure costs roughly 200K to generate. 2006 holdings will have cost roughly $80B in research investment

January 2007 Molecule of the Month: Importins

Complex of 3 proteins which aids in protein synthesis by ferrying molecules back and forth between the inside and the outside of the nucleus through tube-shaped nuclear pores

2006: > 5000 structures in one year, >36,000 total structures

1976-1990, roughly 500 structures or less per year

Growth of Yearly/Total Structures in PDB

how does the pdb work

PDB accessible over the Internet and serves 10,000 users a day (> 200,000 hits)

H. Berman estimated that in 2005, more than $1B of research funding was spent to generate the data that were collected, curated, and distributed by the PDB.

WWW User Interface

Data collected, annotated and validated at one of 3 worldwide PDB sites (Rutgers in US).

Infrastructure required: 20 highly trained personnel and significant computational, storage and networking capabilities.



Query Result




New queries

New queries

New tools

New tools



Infrastructure: PDB portal served by cluster at SDSC. PDB system designed with multiple failover capabilities to ensure 24/7 access and 99.99% uptime. PDB infrastructure requires 20TB storage at SDSC



FTP tree






CORBA Interface

Remote Applications

SDSC machine room

How Does the PDB Work?
using the pdb


to access



to federate









Cell Biology



(PDB level)




Medicinal Chemistry


Using the PDB
  • PDB provides a critical building block for research, education, and practice in the Biosciences
  • PDB tools include
    • Data Extraction and Preparation
    • Data Format Conversion
    • Data Validation
    • Dictionary and Data management
    • Tools supporting the OMB Corba Standard for Macromolecular Structure Data, etc.
storage of research data in sdsc s archives show consistent increase in the need for capacity
Storage of research data in SDSC’s archives show consistent increase in the need for capacity
  • Most of the data is supercomputer simulation output, but digital library collections and experimental data are contributing to growth rates
  • Consistent exponential growth with ~15 month doubling drives planning and cost projections
  • Technology advancements help, but media costs/byte are not decreasing as storage is increasing

Information courtesy of Richard Moore

national data repository sdsc datacentral
National Data Repository: SDSC DataCentral
  • First broad program of its kind to support national research and community data collections and databases
    • “Data allocations” providedon SDSC resources
  • Data collection and database hosting
    • Batch oriented access, collection management services
  • Comprehensive data resources:disk, tape, databases, SRB, web services, tools, 24/7 operations, collection specialists, etc.

Web-based portal access

services tools and technologies key for data related capability
Data Systems





Data Services

Data migration/upload, usage and support (SRB)

Database selection and Schema design (Oracle, DB2, MySQL)

Database application tuning and optimization

Portal creation and collection publication

Data analysis (e.g. Matlab) and mining (e.g. WEKA)


Data-oriented Toolkits and Tools

Biology Workbench

Montage (astronomy mosaicking)

Kepler (Workflow management)

Vista Volume renderer (visualization), etc.

Services, Tools, and Technologies Key for Data-related Capability
increasing need to sustain digital data for the foreseeable future

Digital State and Federal records

Increasing Need to Sustain Digital Data for the Foreseeable Future

The Public Sector

UCSD Libraries

The Private Sector

The Entertainment Industry

Researchers and Educators

what data is the most valuable
What data is the most valuable?
  • Key criteria
    • Irreplaceable
    • Longitudinal
    • Used by many
    • Expensive
    • Needed in the future
    • Culturally or scientifically meaningful

Reference collections

Data needing rescue

Federal records


Irreplaceable data

key challenges for digital preservation
Key Challenges for Digital Preservation
  • What should we preserve?
    • What materials must be “rescued”?
    • How to plan for preservation of materials by design?
  • How should we preserve it?
    • Formats
    • Storage media
    • Stewardship – who is responsible, and for how long?
  • Who should pay for preservation?
    • The content generators?
    • The government?
    • The users?
  • Who should have access?

Print media provides easy access for long periods of time but is hard to data-mine

Digital media is easier to data-mine but requires management of evolution of media and resource planning over time

preservation and risk
Preservation and Risk

Less risk means more replicants, more resources, more people

chronopolis an integrated approach to long term digital preservation


Chronopolis™: An Integrated Approach to Long-term Digital Preservation

SDSC , the UCSD Libraries, NCAR, UMd, NARA, Library of congress, NSF working together on long-term preservation of digital collections

  • Chronopolis™ provides a comprehensive approach to infrastructure for long-term preservation integrating
    • Collection ingestion
    • Access and Services
    • Research and development for new functionality and adaptation to evolving technologies
    • Business model, data policies, and management issues critical to success of the infrastructure
chronopolis replication and distribution

Chronopolis™ Federation architecture


U Md


Chronopolis Site

Chronopolis™ – Replication and Distribution
  • 3 replicas of valuable collections considered reasonable mitigation for risk of data loss
  • Chronopolis™ Consortium will store 3 copies of preservation collections:
    • “Bright copy”– Chronopolis ™ site supports ingestion, collection management, user access
    • “Dim copy”– Chronopolis ™ site supports remote replica of bright copy and supports user access
    • “Dark copy”– Chronopolis ™ site supports reference copy that may be used for disaster recovery but no user access
  • Each site may play different roles for different collections

Dim copy C1

Dark copy C1

Dark copy C2

Bright copy C2

Bright copy C1

Dim copy C2

sdsc playing a leadership role in development of a national digital data framework

Prokudin-Gorskii Photographs

cost to store

SDSC Playing a Leadership Role in Development of a National Digital Data Framework

SDSC storing genetic research data for the City of Hope

value ofcontent

SDSC Developing Data Visualizations for UCSD Moores Cancer Center

SDSC Storing National Collections for National Archives and Records Administration

SDSC working with the Library of Congress on Distributed Data Stewardship

community cyberinfrastructure at sdsc

Thank You

Community Cyberinfrastructure at SDSC

SDSC Summer Institutes, Training, Outreach

DataCentral data repository

Allocated HPC resources (via TeraGrid)

Community CI-oriented R&D projects

SW, visualization and other services