High Performance Computing and the Challenges of Data-intensive Research

Institutional Challenges in the Data Decade: Date: 14-16 December 2011 Day 1 - The Research Data Landscape High Performance Computing andthe Challenges of Data-intensive Research H. Beedie, C.A. Kitchen and M.F. Guest

Overview • Background to High Performance Computing (ARCCA) and the growth of Data-intensive research computing • Nature of the Challenges facing Data-intensive research • Technical • Political & Personal • Financial • Legal & Ethical • How well do we understand these ? • Four data-intensive Case Studies • PET (Positron Emission Tomography) Scanner • Next Generation Genome Sequencing • Gravitational Waves • HPC Wales • Summary and Conclusions HPC and the Challenges of Data-intensive Research

What sort of research is being done at ARCCA? August 2011 Users – 360 Projects – 108 HPC and the Challenges of Data-intensive Research

The Scientific Paradigm has grown • Theory and Experiment • Augmented by Modelling and Simulation, Data Intensive Research and the Digital Humanities • The explosion in scientific data has created a major challenge. With datasets growing beyond a few tens of terabytes, scientists have no off-the-shelf solutions that they can readily use to manage and analyze the data. • Successful projects to date have deployed various combinations of flat files and databases. Typically tailored to specific projects and not easy to generalize or scale to the next generation of experiments. • Today’s computer architectures are increasingly unbalanced; the latency gap between multi-core CPUs and mechanical hard disks is growing every year, making the challenges of data-intensive computing harder to overcome. • What is needed is a systematic and general approach to these problems with an architecture that can scale into the future. HPC and the Challenges of Data-intensive Research

The Challenges 1. Technical • Quantity • how big before it is ‘difficult’ • Performance • how fast do we need it ? • Metadata • how do we interpret ? • Security • how well protected ? Large Data <=> Large Computers ? • Ownership • longer term identity issues • Access • who has the right to see ? • Location • data in many places • Longevity • all the above changes over time HPC and the Challenges of Data-intensive Research

The Challenges 2. Political & Personal • Central vs Departmental • Governance ? • Ownership & Responsibility - who does what ? • One size fits all vs One design per project 3. Financial 4. Legal & Ethical • Sustainability • Grant runs out - data still needed • Energy costs – disks keep spinning • Data migration as technology is updated • Real costs vs. Hidden costs • PhD student as Sysadmin • Electricity not re-charged • Access, Control & identity management • Data Protection • Patient Identifiable Data • Research ‘advantage’ • FoI • Medic-researcher – patient impact ? HPC and the Challenges of Data-intensive Research

Supercomputers – Pioneering data-intensive technologies • Role of supercomputer centres in pioneering data-intensive technologies prior to their making their way into the wider enterprise market. • These have special relevance to the HPC community for “big data” is viewed as the new frontier in HPC. • “Gordon” is the first really big purpose-built supercomputer for data-intensive applications computer recently installed at the San Diego Supercomputer Center (SDSC). • Attract data-intensive science codes that have never had access to a platform of this size. This is particularly true in genomics – the classic “big data” science problem and one of the most frequently cited in HPC circles as suffering from the data delugecrisis. Other important domains include graph problems, geophysics, financial market analytics, and data mining. HPC and the Challenges of Data-intensive Research

Data-Intensive Supercomputers – “Gordon” • The flash memory set-up makes Gordon a unique system. The system is fitted with over 300 TByte of Intel solid state disks, spread over 64 "I/O nodes", enough flash capacity to store the entire Netflix movie catalogue three times over. It's also enough to hold 100,000 human genomes. Inserting this much flash memory into a supercomputer has never been attempted before • The flash device employed is Intel's new iSolid-State Drive 710. The 710 uses Intel's High Endurance Technology (HET), the chipmaker's version of enterprise multi-level cell (eMLC) flash memory that other flash vendors are now offering. Like eMLC, the HET flash features the performance and resiliency of single-level cell (SLC) flash, but at a much lower cost. • Truly impressive is the aggregate IOPS (Input / Output operations Per Second) performance - a peak output of 36 million IOPS (i.e. you could download 220 movies per second). HPC and the Challenges of Data-intensive Research

Data-Intensive Supercomputers – “Gordon” • The other unique aspect to Gordon is its use of ScaleMP's "Versatile SMP" (vSMP) technology allowing users to run large-memory applications a so-called "supernode" - an aggregation of 32 Gordon servers and two I/O servers, providing access to 512 cores, 2 TB of RAM and 9.6 TB of flash. To a program running on a supernode, the hardware behaves as a big cache coherent server. As many as 32 of these supernodes can be carved from the machine at one time. • “GrayWulf: Scalable Clustered Architecture for Data Intensive Computing”, Alexander S. Szalay, Gordon Bell et al, MSR-TR-2008-187 • an order of magnitude better balance than existing systems • However … The traditional approach of bringing the data to where there is an analysis facility is inherently not scalable, once the data sizes exceed a terabyte due to network bandwidth, latency, and cost. • Is the best approach to bring the analysis to the data ? HPC and the Challenges of Data-intensive Research

PET (Positron Emission Tomography) Scanner Next Generation Genome Sequencing Gravitational Waves HPC Wales Case Studies HPC and the Challenges of Data-intensive Research

PET Project I • The development of the Wales Research and Diagnostic Positron Emission Tomography Imaging Centre (PETIC) on the Heath Park Campus is now taking Cardiff into the forefront of medical imaging technology. • Project is led by the University in partnership with Cardiff and Vale NHS Trust, funded by a £16.5M investment from WAG. Housed in a purpose-built building on the Heath site, this is a state-of-the-art research facility which also now offers NHS patients scanning facilities not previously available in Wales. The development also includes a pre-clinical scanner that will enable cutting-edge research in the medical, biological and life sciences. • PETIC will lead to better understanding of biological processes and structures; new clinical, diagnostic and therapeutic strategies; earlier detection of disease; and, ultimately, more successful results for patients. It will offer three different types of imaging: • Positron Emission Tomography; • Single Photon Emission Computed Tomography; and • X-Ray Computer Tomography HPC and the Challenges of Data-intensive Research

PET Project II • Clinical and Research patients • Patient Identifiable Data (pseudonymised) • NHS Academic LAN data transfer • Firewalls and politics • Legal ‘data processor’ contract with NHS • Potential 40TB/year (RAW + Image data) • Large quantity, but not high performance • Proprietary metadata • Data Location • Researchers want Raw Data + Image, no PID • Clinicians only want Image, need PID • Copy Raw Data + Image to Academic storage • Leave Image only on NHS storage • Longevity – TBD • Political issues – ownership and control of project. HPC and the Challenges of Data-intensive Research

DNA Sequencing Caught in Deluge of Data Business Day November 30, 2011 • The leap forward in genomics technology promises to change health care as we know it. Sequencing a human genome, which costs £millions just a few years ago, now costs thousands. The prospect of mapping a genome for £1000 is on the horizon. • But cheap gene sequencing, by itself, won't usher in a health care revolution - turning these into something useful is the true bottleneck e.g., to use patients genome to determine their susceptibility to specific diseases or to devise personalized treatments for conditions they already have. • Sequencing all the DNA base pairs is the easy part of the problem. It just reflects the ordering of these bases – adenine (A), thymine (T), guanine (G), cytosine (C) – in the chromosomes. The bioinformatics software necessary to extract useful information from this biomolecular alphabet is far more complex and costly, and demands substantial computing power. HPC and the Challenges of Data-intensive Research

DNA Sequencing Caught in Deluge of Data • It costs more to analyze a genome than to sequence it, and that discrepancy is expected to grow. The cost of sequencing a human genome has decreased by a factor > 800 since 2007, while computing costs have only decreased by a factor of four. That has resulted in an enormous accumulation of unanalyzed data that is being generated by all the “cheap” sequencing equipment. • Not only is there too much data to analyze in aggregate, it's also too difficult to share that volume of data between researchers. Even the fastest commercial networks are too slow to send multiple TBytes of information in anything less than a few weeks. • That's why BGI (Beijing Genomics Institute), the largest genomics research institute in the world with 167 DNA sequencers, has resorted to sending computer disks of sequenced data via FedEx. HPC and the Challenges of Data-intensive Research

DNA Sequencing Caught in Deluge of Data • Cloud computing may help alleviate these problems. • some believe that Google alone has enough compute and storage capacity to handle the global genomics workload. • Others that there is just too much raw data – researchers will have to pre-process it to reduce the volume or just hold onto the unique bits. • But there are even more challenging problems ahead. Metagenomics – aggregating DNA sequences of a whole population of organism – is even more data-intensive e.g. the microbial species in the human digestive tract represent about 106 times as much sequenced data as the human genome. Since that population can have a profound effect on its host, that data becomes a pseudo-extension of the person's genetic profile. • In addition there is the data associated with the RNA, proteins and other various biochemicals in the body. A complete picture of human health needs all of this data integrated as well. HPC and the Challenges of Data-intensive Research

Requirements for Psychological Medicine at Cardiff • Capacity and Performance, but also.. • Long Term storage, including archival • Parallel, 4 servers, 16 cores per server • Metadata (solution to be determined) • Problem – design to meet the conflicting requirements • Solution - Fiber Channel (SSD and SAS disks) and Ethernet connected (2TB SATA disks) storage array • Technical Solution – involve IT personnel • Legal and Ethical – minimal concerns • Financial – MRC funded • cost important; Energy costs – use 2TB SATA where possible • Personal and Political • Department had almost bought something that would have delivered slower performance than current solution • VIP not wanting project delay HPC and the Challenges of Data-intensive Research

Gravitational Astronomy – The New Frontier Prof. B.S. Sathyaprakash • Computational cost of searches - current searches are limited by computational resources • Can only search for non-spinning black hole binaries • A month of search takes 60,000 CPU hours • Bigger computers afford better searches • Future Searches • Search for black holes with spins - Would require at least 10-100 times more computational cost • Large data, serial processing, not fast. • No security, politics, legal, ethical issues • Financial – low budget • Energy – move to tape after processing Gravitational Waves: Ripples on space-time curvature travelling at the speed of light HPC and the Challenges of Data-intensive Research

Amplitude Time Targeting the biggest discovery of our times through ARCCA • Here are some signals from colliding black holes as predicted by Einstein’s theory • Black hole spins modulate the waveform • We use matched filtering to search for signals buried in noise • Pattern matching algorithm • But matched filters, i.e. templates used in the search, depend on many parameters • A search in 17-dimensional space involving the masses, spins of the stars, position on the sky, etc. • About 100 million shapes must be searched for in each piece of data Increasing Spin HPC and the Challenges of Data-intensive Research

Three Elements HPC Wales ‘Connecting Researchers, Businesses and People’ • World Class HPC Capacity • Establish an HPC Infrastructure and capability that meets the needs of key innovation projects • Provide HPC support and access to business & academic innovation networks • Connect and enable research across Wales in priority sectors. • HPC Skills Academy • Deliver significant improvements in advanced computing skills via outreach support. • Supply the qualifications needed to up skill Convergence Wales to harness the opportunities of HPC. • Provide and Facilitate innovative multimode and multi-model methods of blended and credited HPC Learning. • Provide outreach support to train individuals in the use of HPC and to address specific requirements. • TECHNIUM • BANGOR • TECHNIUM • GLYNDWR • HPC HUB • Tier 1 • Tier 2 • Research and Innovation Institute • Deliver research and innovation outputs that have economic impact in Wales. • Work with and stimulate ICT industry growth and enhance inward investment opportunities • Engage with priority sectors & individual enterprises through collaborative projects to support job creation and innovation. • Support Collaborative R&D Projects to promote innovation across the priority sectors. • ABERSYTWYTH • TECHNIUM • LAMPETER • TRINITY • TECHNIUM • SWANSEA MET • SWANSEA • GLAMORGAN • TECHNIUM • NEWPORT High speed link • CARDIFF • UWIC HPC and the Challenges of Data-intensive Research

General context Web interface Workflow automation Access from anywhere External access Global scheduling Global Virtualisation Hubs Tiers SMEs Partners HPC Wales HPC and the Challenges of Data-intensive Research

Global virtualisation of HPC Wales sub-system • SynfiniWay is a service-based IT framework for distributed computing • SynfiniWay abstracts real resources to present a service view of tasks to the business process layer – system & location independent • Users never login to any sub-system • All work is handled through one of the SynfiniWay Directors • Services are published globally across the framework • Any user, anywhere can access any service (where permitted) • Access - web browser, Java client (in desktop) or CLI (login nodes) • I/O can be transferred anywhere in the framework, not just areas within direct reach of execution system • Workflow approach to HPC that avoids the inefficiencies and waste normally produced by increasing scale • Re-use of services in technical workflows • Independence of hardware and system • Transparent access anywhere in the global organisation HPC and the Challenges of Data-intensive Research

Software solution: high-level architecture HPC Wales System Portal Gateway Gateway Gateway Gateway SynfiniWay Data Applications LSF LSF Linux Windows Linux Windows RTM, ServerView, SynfiniWay PCM HPC and the Challenges of Data-intensive Research

SynfiniWay functions Application platforms, business processes and policies SynfiniWay Portal User client Admin client API Process and Data Engine Meta-Scheduler Dynamic global orchestration Business-based policies Automated technical processes Global data access, transfer & view Re-scheduling between servers Heterogeneous resource manager Implicit, optimised, resilient data transfer Bulk & large-file movement Globalised Dynamic Services Security and Governance Applications-As-A-Service Service publication & discovery Unique user identity with RBAC Audit & accounting Traceability & data security Resilience and robustness Global admin desktop Sensors & monitoring Local, global, externalised or collaborative infrastructure HPC and the Challenges of Data-intensive Research

Workflow approach to HPC and Data • Ability to build business logic which can be run on a global scope from any location • Easy way to publish defined business intelligence (workflow) • Provide a rich set of features enabling creation of intricate workflow structure • Manage implicit data movement on behalf of users • Remove IT dependencies from end-user business logic HPC and the Challenges of Data-intensive Research

Sample Workflow: SatsumaSynteny SPINES - three sequence alignment packages: Satsuma, a highly parallelized program for high-sensitivity, genome-wide synteny Cardiff Bangor HTC cluster node Parametrise and submit workflow Task scheduled on single core Split files read separately into each task Interpret results Concurrent tasks scheduled on multiple cores Filestore Tasks scheduled on single core Result data flow Store results HTC cluster Archive output Scheduler might select any site for execution Service published from application hosts Service and server tags allow user targeting Medium cluster node Filestore HPC and the Challenges of Data-intensive Research

Summary and Conclusions • Background to ARCCA and the growth of Data intensive research • Outlined nature of the Challenges facing Data-intensive research • Technical, Political & Personal,Financial,Legal & Ethical • How well do we understand these ? • Presented an outline of four Case Studies • PET (Positron Emission Tomography) Scanner • Next Generation Genome Sequencing • Gravitational Waves • HPC Wales HPC and the Challenges of Data-intensive Research

High Performance Computing and the Challenges of Data-intensive Research

High Performance Computing and the Challenges of Data-intensive Research

Presentation Transcript

High-Performance Grid Computing and Research Networking

Data-Intensive Computing

Research (and Fun) in High Performance Computing

Wei Jiang Data-Intensive and High Performance Computing Research Group

Data Intensive Computing

High Performance Cyberinfrastructure Discovery Tools for Data Intensive Research

High-Performance Grid Computing and Research Networking

Software and High Performance Computing: Challenges for Research The Implications of PITAC

High Performance Computing Research

Data Analysis and High Performance Computing

“High Performance Cyberinfrastructure for Data-Intensive Research”

High-Performance Grid Computing and Research Networking

High Performance Computing Challenges and Trends

“Set My Data Free: High-Performance CI for Data-Intensive Research”

High Performance Cyberinfrastructure Required for Data Intensive Scientific Research

PhD Research High Performance Computing

Data Analysis and High Performance Computing

High Performance Computing Challenges and Trends

High Performance Computing and the Challenges of Data-intensive Research

High-Performance Grid Computing and Research Networking

High-Performance Grid Computing and Research Networking