Life Science and Healthcare Computation on High Performance IT Infrastructures

Life Science and Healthcare Computation on High Performance IT Infrastructures Jeff Augen, CEOAndrew Sherman, CTO Shelton, Connecticut USA

Presymptomatic disease prediction (Augen) IT Infrastructure for Computational Biology (Augen) The Grid as a Virtual HPC Platform (Sherman) High Performance Computational Workflows (Sherman) Concluding Thoughts: The Future of Diagnostic Imaging (Augen) Outline

pre 1930 History & Physical 1930-1950 stethoscope, 1950-2000 x-ray, Lab Tests, Cat Scan, MRI 2000 - Applied Genomics Modern medicine is built on an evolving set of tools. Organs / Tissues Cells / Molecules Whole Body Medical devices, diagnostic tests New targets, lead Molecular level compounds, understanding chemical reactions of disease Genomic information (content databases) Complex database infrastructure: clinical, genomic, demographic, tissue specific gene expression

The up-and-down regulation view of geneexpression profiling is overly simplistic Graphical Representation of 10 microarray experiments groups 1 and 3 are sharply upregulated in experiment #8 50.0 groups 1 and 2 are sharply downregulated in experiment #5 1 5.0 (Cy5/Cy3) 2 0.50 1 2 3 4 5 6 7 8 9 10 3 groups 2 and 3 are sharply downregulated in experiment #7 0.05 Experiment # (1-10)

The relationships between genetic makeup and disease are difficult to identify. The solution involves millions of clinical records, extensive genetic analysis, and tissue specific gene expression databases. • 4 factors determine clinical status: • Basic gene sequence • “Epigenetic” factors • Random events that affect gene expression and regulation • Environmental effects on gene expression

Evolving Views Continue to Shape Molecular Biology and Medicine Mid to late 90’s Monogenic diseases (searching for the gene for…) 2000 Death of the one-gene-one-protein view Realization that a small number of genes code for millions of proteins Complex patterns of up-and-down regulation of individual coding regions Focus on the genome 2002 - 2003 Minor messages RNA interference Complex regulatory roles for splicing Focus on the transcriptome 2004 Focus on the proteome Systems biology takes center stage Metabolic regulatory networks

The transcriptome contains a variety of RNA species – mRNA, miRNA, siRNA, pre-spliced mRNA, and double-stranded RNA. Many processes contribute to the variety of RNA species present in a cell, and the production of a single protein involves expression of several different genes including some that code for enzymes involved in post-translational modifications 5’ 5’ 3’ 3’ 5’ 5’ 5’ 3’ 3’ 3’ 5’ 5’ 3’ 3’ 3’ UTR Endonucleases cleave the target mRNA AAA 5’ 5’ 3’ Double stranded RNA synthesized by RNA directed RNA polymerase (RdRP) 5’ 3’ Single stranded miRNA becomes a template for target mRNA binding Spliced transcripts Complementary mRNA binds to the RISC complex siRNA digestion and RISC complex formation 3’ siRNA duplexes 5’ 3’ microRNA (miRNA) precursors Pre-spliced mRNA siRNA Precursors

The vast majority of transcripts - more than 75% - are present in fewer than 100 copies; half are present in fewer than 10 copies. Since a typical human cell contains approximately 1 million mRNA transcripts at any point in time, these small copy count species can be thought of as existing in the part per million range. Source: Lynx Therapeutics

The Splice Variant Problem is Especially Significant in Transcriptional Profiling 5'...AUG UGU UGG AUU ACG GCC GAA UGG UAC CAG AUU AUA UAG G...3' START CYS SER ILE THR ALA GLU SER TYR GLN ILE ILE STOP Splice Variant #3 Remove AGACCCAAGA Splice Site 5'...AUG UGU UGG AUU ACG GCC GAA UGG UAG ACC CAA GAA CUC AGA UUA UAU AGG...3' START CYS SER ILE THR ALA GLU SER STOP Splice Variant #2 Second reading frame 5'...AUGUGUUGGAUUACGGCCGAAUGGUAGACCCAAGAACUCAGAUUAUAUAGG...3' Unspliced mRNA First reading frame 5'...AU GUG UUG GAU UAC GGC CGA AUG GUA GAC CCA AGA ACU CAG AUU AUA UAG G...3' START VAL ASP PRO ARG THR GLN ILE ILE STOP Splice Variant #1

The human thrombopoietin (TPO) gene produces a mixture of transcripts, all of which translate inefficiently because of an overlapping upstream open reading frame (ORF7). This low level of translation is normal; increasing the production of TPO causes a disease known as thrombocythaemia. Overlap region normally prevents efficient translation of TPO ORF-7 TPO Gene GCCGCCUCCAUGG ………………(STOP)…AUG………………………(STOP) Point mutations can restructure ORF-7 increasing the chance that TPO will be translated 1. Frame shift moves ORF-7 into the TPO reading frame 2. Point mutation in start codon 3. Insertion of premature termination codon exposing the TPO start site 4. Mutations that entirely obliterate the ORF-7 start site 5. Mutations in the ribosomal binding sequence which increase the probability of leaky scanning to the downstream TPO gene

mRNA profiling presents several significant IT challenges. Some are computationally intensive and lend themselves to parallel computing in clustered environments. Others require large multiprocessing machines with a single large memory space. Still others require access to heterogeneous data sources. • Universal query tools used to search protein and nucleotide sequence databases for mRNA sequences obtained from microarray and signature sequencing • Sequence alignment algorithms for assembly of complete messages from short oligos on microarrays. • Search algorithms that align pre-spliced messages with known proteins by examining all exon-intron combinations • Unbounded pattern discovery algorithms which can spot repeating sequences across large gaps. • Algorithms for predicting translational events (e.g. reinitiation and leaky scanning) from base RNA sequences • Clustering algorithms and neural networks for building and comparing expression profiles

Applications & Systems Execution Management Application-Aware Scheduling System Infrastructure for computational biology can be visualized in layers. Robust parallel computing solutions involve key components at each layer. Life sciences problems are heterogeneous and geographically dispersed as well as computationally intensive – grid is a perfect infrastructure. Parallel Applications High-performanceautomated workflows High-PerformanceDistributed/GridComputingInfrastructure Runtime accelerator forapplications & workflows,including high performanceprogram development APIs Batch queuing systems

EMBOSS - A comprehensive set of ~100 analysis tools plus ~50 display and utility applications NUCLEIC ACID 2D STRUCTURE CODON USAGE COMPOSITION CPG ISLANDS GENE FINDING MOTIFS MUTATION PRIMERS PROFILES REPEATS RESTRICTION TRANSCRIPTION TRANSLATION PROTEIN 2D STRUCTURE 3D STRUCTURE COMPOSITION MOTIFS MUTATION PROFILES ALIGNMENT CONSENSUS DIFFERENCES DOT PLOTS GLOBAL LOCAL MULTIPLE PHYLOGENY ENZYME KINETICS FEATURE TABLES EDIT DISPLAY UTILS - DATABASE CREATION - DATABASE INDEXING - MISC INFORMATION

clustalw hmmbuild hmmcalibrate hmmsearch seqstat Optimization and parallelization of complex workflows is the key to building high performance computing solutions for bioinformatics. Step #1: Building a model and finding its additional matches Clustalw: Builds a basic model (multiple alignment) for a (sequence) database. HMMBuild: Builds an advanced model (HMM) based on the basic model. HMMCalibrate: Calibrates the search statistics for the model (HMM). HMMSearch: Searches for matches of the model (HMM) in a different (sequence) database. Seqstat: Computes basic statistics for a (sequence) database.

. . . hmmsearch PfamJoiner hmmsearch FastaSplitter Phmmsearch hmmsearch . . . Workflow Complexity – Parallelism Step #2: Accelerating the matching step via parallelization clustalw hmmbuild hmmcalibrate Phmmsearch seqstat

While (model can be refined) { clustal_hmmer clustalw hmmbuild hmmcalibrate hmmsearch create_alignment_input } Workflow Complexity – Loops and Conditionals Step #3: Adding loop structures to the workflow

Computational biology involves complex links to heterogeneous systems. Various links provide acceleration opportunities if the workflow is well understood. Linked Table System HTML-based tables containing links to multiple data sources Application / Query Tool Data source #1 Data source #2 Linked / Indexed Sources Abstraction Layer Federated Database Server cost-based optimizer determines the best query plan query fragments are constructed for each data source Data source #3 Data source #4 Federated Sources Wrappers for each datasource Terminology Server Data source #5 built on structured life sciences ontology Data source #6 Ontological Links

Optimized workflow involves load balancing, resource management, and data / application parallelism. • Most life sciences applications do not scale well through multithreading. • Most life sciences applications have multiple components that can be run in parallel (e.g. protein folding, sequence matching, genome assembly, protein mass spec.) • Understanding and mapping the workflow allows the creation of parallel workflows • Data parallelism is as important as application parallelism. Both are comprehended as part of a typical workflow • Saving time on the “wall clock” is as important as saving time on the system clock

The Grid:A Virtual HPC Facility User User User User User LinuxCluster SMP Evolution of the Grid • Data: Many diverse sources and repositories for data (the “Data Grid”) • Computing: • Global: Relatively few powerful managed computing facilities • Local: Many heterogeneous machines (often unmanaged)

An Ideal Virtual HPC Platform Requires (at least) . . . Scalable Architecture (SmartGrid on top of the “Compute Grid”) Resource Management (Cooperating Local Managers) Data Management & Delivery (The “Data Grid”) Application Management & Acceleration (Workflows & Parallelism) Virtual High Performance Computing Platforms

SmartGrid Architecture: Building Blocks Application Data (Data Grid) Shared MemoryServer ClusterManagerCluster LSF/SGE/PBSCluster Windows, MacLinux, UNIX (Compute Grid Sites)

SmartGrid Architecture: Local Agents Application Data (Data Grid) Agent Agent Agent Agent Shared MemoryServer ClusterManagerCluster LSF/SGE/PBSCluster Windows, MacLinux, UNIX (Compute Grid Sites)

SmartGrid Architecture: Global Services Hub (SmartGrid Management & Services) Application Data (Data Grid) DRM Agent DRM Agent SMP Agent Workstation Agents Shared MemoryServer ClusterManagerCluster LSF/SGE/PBSCluster Windows, MacLinux, UNIX (Compute Grid Sites)

SmartGrid Architecture: Shared System State Hub (SmartGrid Management & Services) Application Data (Data Grid) SharedSystemState DRM Agent DRM Agent SMP Agent Workstation Agents Shared MemoryServer ClusterManagerCluster LSF/SGE/PBSCluster Windows, MacLinux, UNIX (Compute Grid Sites)

SmartGrid Architecture: External Access User & Administration Interfaces(Builder, Web, CLI, Web/Grid Services) Hub (SmartGrid Management & Services) Application Data (Data Grid) SharedSystemState DRM Agent DRM Agent SMP Agent Workstation Agents Shared MemoryServer ClusterManagerCluster LSF/SGE/PBSCluster Windows, MacLinux, UNIX (Compute Grid Sites)

Applications & Workflows Application Integration & Interoperability Automation, Repeatability & Compliance Performance: Dynamic Scheduling, Concurrency & Parallelism Fault Tolerance Other Components of a Virtual HPC Platform Databases & Files Global Accessibility Data Delivery Caching / Local Storage Mgmt Data Security Reliability Data Management Application Management Resource Management Facility Allocation Accessibility Authentication & Security Local Resource Management & Scheduling

Computational Workflows A natural, visual way to describe complex analytical processes that involve many computational steps and significant logic

ComputationalWorkflows Characterizing Computational Workflows Business Value Repetition Ref: Production Workflows (Leyman, Roller)

Access Data A B C Store Data Fast Slow Fast Traditional Life Science Workflows Typical “Human-in-the Loop” Workflow: • Manual component startup • “Cut and paste” data movement • Sequential execution • Limited throughput due to “bottleneck components”

B Access Data A B C Store Data Fast Fast B Fast A Better Way: Automation & Parallelism TurboWorx High-Performance Workflow: • Automated component startup & data conversion • Pipeline acceleration via dynamically distributed execution • Transparent data-driven parallelism to eliminate bottlenecks

Integrate, manage, & accelerate heterogeneous collections of applications using object-oriented design principles Facilitate collaboration and reuse to save time in the design, testing, and deployment of new computing solutions Enable massive amounts of data processing by exploiting data-parallel concurrent computing without application modification Maximize resource and data utilization through predictive dynamic data routing and component scheduling Enhance quality through repeatability and monitoring for reliability and compliance The Role of Workflows in HPC Grid Computing Workflows address some critical HPC computing challenges. Workflows:

TurboWorx Builder Wizard AtomicComponentCreation ClustalW { } ApplicationJava MethodJython Script TurboWorx Component Component Library WorkflowComponentCreation

Data Parallelism and Workflow Concurrency TurboWorx High-Performance Workflow (as created): SPLIT JOIN Access Data A B C Store Data Fast Fast Slow Splitting enables data and pipeline parallelism: The components can run concurrently ondifferent machines using independent data

B B Data Parallelism and Workflow Concurrency TurboWorx High-Performance Workflow (as executed): SPLIT JOIN Access Data A B C Store Data Fast Fast Fast The Hub determines the degree of concurrency dynamically at run time

"Traditional" NFS Data Access 60 40 Speedup 20 0 0 10 20 30 40 50 60 Number of Nodes Traditional Data Access Does Not Scale • To enhance scalability, combine: • Dynamic predictive scheduling • “Peer-to-peer” data provisioning

Accelerating Data Access: Peer-to-Peer Data Provisioning Workflow Data Staging Workflows enable systems to pre-stage data and make use of local caches to avoid excessive load on servers and network infrastructure.

App ManagedDataset ManagedDataset App Data Segment Data Segment Data Segment Data Segment Data Segment Data Segment Data Segment Data Segment App App Peer-to-Peer Data Provisioning Data Segment Data Segment Registration Splitter Database Data Segment Data Segment Segment 1 Data Management Server Segment 2 Segment 3 Segment 4 The first time the dataset is required by an application, segments are moved from Data Management Server to compute nodes. The nodes cache the data for future use. Compute Nodes

Data Management Server ManagedDataset Data Segment Data Segment Data Segment Data Segment Peer-to-Peer Data Provisioning When the same dataset is required again, tasks can be scheduled preferentially on nodes where the data segments are already cached. If that’s not possible, and one or more nodes need the cached data but don’t have it … Segment 3 Segment 1 App … segments can be moved from peer to peer without going back to the main data server … Busy Segment 2 Segment 1 App App … eliminating bottlenecks • I/O bottleneck at the main data server • Communications bottleneck for remote data Segment 3 Busy Segment 4 Compute Nodes App The nodes monitor versioning information for cached data, so they know when to download new copies of the segments.

The Result: Scalable Performance Application Scaling 60 40 Speedup 20 0 0 10 20 30 40 50 60 Number of Nodes

Diagnostic Imaging: Business Model Diagnostic Imaging represents one of the largest opportunities in Life Sciences - embedded processors, application servers, storage systems, and integration services. A new business model based on a centralized storage and compute utility is emerging. • 33,000 Radiologists in the US reading 15,000 images/year = 495 million images/year • More than 150 petabytes/year worldwide • 150 petabytes = $45B and requires 150,000 people to manage • 3 major manufacturers - GE, Philips, Siemens Current Versus Proposed Model for Data Distribution Doctor's Clinic Hospital Office Hospital Network Network Imaging Center Customer Managed Customer managed disaster recovery "Locked in" technology Restricted Infrastructure E-archive/ASP Outsourced Robust disaster recovery Flexible storage technology Broad infrastructure

Diagnostic imaging is driving the integration of health care systems globally (much as the Internet drove the integration of business systems) Customer pains driving diagnostic imaging adoption are more financially focused than technology driven Technology choices, integration and maintenance requirements are becoming too complex for hospital /clinic IT departments Regulatory, liability and security issues are driving factors Increased adoption of molecular imaging, important for presymptomatic testing, drug delivery and drug efficacy Components needed for the success of diagnostic imaging systems: Terascale-size storage systems (50M radiology studies in the US generate >20PBytes/year) Flexible server clusters Integration services, systems support, data centers, architectural systems planning, regulatory compliance Diagnostic Imaging: Drivers

Diagnostic Imaging: Changing the Rules The evolution of medical informatics and the grid present an opportunity for a large vendor like Philips to deploy a “game-changing” infrastructure for diagnostic imaging • Enable ‘Multi-Modality’ Image Capture, Comparison and Diagnostics • Build a platform which includes middleware, infrastructure and a ‘Pervasive Imaging’ capability to support multi-modality, *-ology, Computer Assisted Diagnosis (CAD), and 3D Image Reconstruction • Leverage business partnerships to become the leading IT infrastructure provider for 1) integrating electronic clinical records and diagnostic images • Provide a clinical platform that can be extended with data mining tools • Drive the development of e-Utility/Grid services for that enable ‘image and diagnostic data management’ for the provider and payor communities

Life Science and Healthcare Computation on High Performance IT Infrastructures

Life Science and Healthcare Computation on High Performance IT Infrastructures

Presentation Transcript

High Performance Computing and Computational Science at AHPCC

Healthcare, Life science and Chemical Data Management Services

High Performance Buildings: Achieving Superior Performance for Life

Work-life Balance and Employers’ “High Performance” Practices

DoD High Performance Computing Science and Engineering Applications

Modeling Ion Channel Kinetics with High-Performance Computation

The Impact of National and Regional Healthcare IT on Pay for Performance

Performance of a high throughput multichannel detector for life science applications

Cyberinfrastructure for Scalable and High Performance Geospatial Computation

Performance and Power Management for Cloud Infrastructures

Predicting performance of applications and infrastructures

High-Performance Computation for Path Problems in Graphs

Healthcare Science

Center for High Performance Visualization and Computation

Managing Diverse IT Infrastructures

Healthcare Science and Technology

The Impact of Pay for Performance on Healthcare IT

Data infrastructures for Science

High Performance Computing and Computational Science

High on Life

High-Performance Computing, Computational Science, and NeuroInformatics Research