1 / 90

Bioinformatics - PowerPoint PPT Presentation

  • Uploaded on

Bioinformatics. Cindy Burklow, Kyle Eli, Clay Harris. What is Bioinformatics?. “Any use of computers to handle biological information.” Or, more specifically: “The use of computers to characterize the molecular components of living things.”. What is Bioinformatics?. Biomolecules

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Bioinformatics' - abba

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


Cindy Burklow, Kyle Eli, Clay Harris

What is bioinformatics
What is Bioinformatics?

  • “Any use of computers to handle biological information.”

    • Or, more specifically:

  • “The use of computers to characterize the molecular components of living things.”

What is bioinformatics1
What is Bioinformatics?

  • Biomolecules

  • “Doing Bioinformatics”

    • And simulate!

  • Classical bioinformatics deals primarily with sequence analysis

    • Polymers

    • Monomers

    • Macromolecules

    • Sequences

What is bioinformatics2
What is Bioinformatics?

  • “Post-genomic” era

    • Comparative genomics

    • New technologies to measure gene expression

    • Large-scale methods for identifying gene function

    • A shift to finding gene products

      • Proteomics

      • Structural Genomics

Bioinformatic fields
Bioinformatic Fields



Computational Biology


Mathematical Biology

Medical informatics/Medinformatics





  • Basic Local Alignment Search Tool (BLAST)

  • Collection of Software Program Tools

  • Software version 2.1.13 offered by National Center for Biotechnology Information at the National Institutes of Health (NCBI)

  • Compares nucleotide or protein sequences to sequence databases

  • Finds regions of local similarity between sequences

  • Calculates the statistical significance of matches

  • Helps infer functional relationships between sequences and identify members of gene families


  • Offers different program tools & databases

  • Provides Guide to help users decide on which BLAST tool to used based on Nature & size of the input query Primary goal of the search

  • BLAST search comprises four components:QueryDatabaseProgramSearch purpose/goal

Ways to interface with blast
Ways to interface with BLAST

  • Uses Standardized application program interface (API) for accessing the NCBI QBIAst system

  • Uses direct HTTP-encoded requests to NCBI web server

  • Blast utilities allow you to run searches on your own computer

  • NetBlast has command-line network clients that allow you to submit searches to NCBI

A case study of high throughput biological data processing on parallel platforms

A Case Study of High-Throughput Biological Data Processing on Parallel Platforms

San Diego Supercomputer Center and Department of Pharmacology, University of California

History on Parallel Platforms

  • Work has been done in this area for over the past 20 years developing structure comparison algorithms for proteins structures

  • Traditionally uses conventional functionally-driven structure determination

  • Algorithm Classifications to build alignments:Single ResiduesFragments of multiple residuesSecondary Structure Elements

    • CHALLENGE: Highly redundant datasets requiring very large computations to be performed to gain insight into the meaning of the data

Protein structures
Protein Structures on Parallel Platforms

Used for protein classifications, better understanding of function and clear explanation of distant homologous relationships not possible from sequence alone since sequence is more variable than structure

  • What is important about Protein Structures?

  • Comparing a single data sequence string against a very large sequence database called Protein Data Bank (PDB)Types of Comparisons

  • Sequence-Sequence

  • Sequence-Structure

  • Structure-Structure

Scale of problem
Scale of Problem on Parallel Platforms

  • Protein Data Bank of 35,000 chains

  • Pairwise comparison = average ~3 seconds.

  • Without considering redundancy or chain size a complete computation would take average…. ((35,000 * 35,000)/2) * 3 seconds 21,000 processor-days or 58 YEARS!!!!TIME IS A BIG PROBLEM!!!

Problems on Parallel Platforms

  • Determination & Comparison of 3-D protein structures

  • Massively parallel computations are needed

Background on Parallel Platforms

  • Looking for more efficient way to analyze large data sets

  • Taking advantage of redundancy present in data sets

  • KEY: Data Preprocessing Step & Organization of data being searched BEFORE begin passed to PARALLEL COMPUTERS

Other issues to consider
Other Issues to Consider on Parallel Platforms

  • Algorithm should give optimal performance

  • Scale with the number of processors involved.

Optimization procedures
Optimization Procedures on Parallel Platforms

  • Dynamic Programming

  • Monte-Carlo

  • Graph Theory

  • Combinatorial Search

What does cepar stand for
What does CEPAR stand for? on Parallel Platforms


Combinatorial Extension Algorithm

Parallel Mode

What is combinatorial extension algorithm
What is Combinatorial Extension Algorithm? on Parallel Platforms

  • Method of automatically aligning pairs of structures

  • Compiles an alignment of a give pair of protein chains by considering the chains sectioned intoall possible octapeptide fragments, as defined by the backbone α-carbons

  • Those octapeptide pairs that have high distance-based similarity score are deemed “aligned fragment pairs” & used in the next step of analysis

  • Then the CE algorithms tries to join each Alignment Fragment Pairs (AFP) to a maximal number of other AFPs in order to create the longest possible alignment path through the two proteins in consideration (w/ allowance for gaps of up to 30 residues in either protein chain). Switch together a set of AFPs covering contiguous region.

What is combinatorial extension algorithm1
What is Combinatorial Extension Algorithm? on Parallel Platforms

  • After possible paths through two proteins are determined, CE uses additional heuristics to try to improve the final alignment

  • The 20 best scoring paths are compiled & proteins are directly compared based upon the super-imposition of the aligned residues.

  • The path that yields the lowest Root Mean Square Deviation (RMSD) is retained as the “optimal path”.

  • Then this path is subjected to dynamic programming on structural alignment directly between the two structures, which test all possible residue equivalences & resulting RMSD from their superposition.

Parallel algorithm
Parallel Algorithm on Parallel Platforms

  • CEPAR uses coarse-grain parallel implementation involving a master/worker strategy suitable for a massively parallel computer architecture.

  • A parallel algorithm, as opposed to a traditional serial algorithm, is one which can be executed a piece at a time on many different processing devices, and then put back together again at the end to get the correct result.

What does cepar do
What does CEPAR do? on Parallel Platforms

  • Finds pairwise protein structure similarities

  • Pairwise 3D protein structure comparison

  • Aligns protein structure from Protein Data Bank

  • Matches protein structure-to-structure

  • Runs on a large number of processors

How does cepar work
How does CEPAR work? on Parallel Platforms

  • Optimizes the use of Combinatorial Extension algorithm for the pairwise alignment of polypeptide chains to manage comparative structural information

  • Builds a structurally representative set of protein chains & reveals structure similarities in the Protein Data Bank that scale with fast growing source of data

How does cepar work1
How does CEPAR work? on Parallel Platforms

  • Only one master processor was used. It was not advantageous to use more than one master processor, because communication issues.

  • Each worker receives work assignment from master compares 2 entities contained in the assignment using CE algorithm, returns results of the comparison to the master & is ready to receive another assignment

  • Workers only need to communicate with the Master processor and not each other

  • Program written in C++ and uses MPI for communication between master & workers

Computer on Parallel Platforms

  • “Blue Horizon” – IBM SP parallel computer at the San Diego Supercomputer Center

  • 1152 Power3+ processors each running at 375MHz

  • Sun Enterprise 10,000 server & Linux PC cluster

  • Software can work on any parallel machine or PC cluster with Message Passing Interface (MPI)

Assignments problem formulation
Assignments & Problem Formulation on Parallel Platforms

  • Entity list of N entities where each entity is protein polypeptide chain characterized by amino acid sequence & a set of 3D coordinates

  • Algorithm for pairwise comparison of entities (CE)

  • Select Representative Protein Structure

  • Order of Operations

Representation criteria notes
Representation Criteria Notes on Parallel Platforms

  • Looking for similarity criterion between representatives

  • Alignments not satisfying this criterion are not recorded

  • Output: List of representatives as well as entities represented by them & detailed information on alignment satisfying either representative or similarity criterion

  • It is not vector quantization (so to minimize computer time)

  • Representatives are randomly chosen instead of calculating the centroid of a cluster

  • Applied criteria is believed to adequately describes the structural space of the Protein Data Bank

Representation criteria
Representation Criteria on Parallel Platforms

Sequence Lengths of two entities: L1 & L2

Length difference threshold parameter: Lthr

Number of aligned positions: Lali

Alignment length threshold parameter: Athr

Representation criteria1
Representation Criteria on Parallel Platforms

Gap threshold parameter: Gthr

Number of residues in gaps: Lgap

Final RMSD of the alignment RMSD < Rthr, where Rthr is the RMSD threshold parameter

Order of operation
Order of Operation on Parallel Platforms

  • Entity-first (2-step)

  • Family-first (2-step)

  • Family-first (1-step)

New problems uncovered
New problems uncovered…. on Parallel Platforms

  • Running CEPAR in one step produces limited scalability causes….Limited Scalability

  • WHY? At High processor count…1. Number of idle workers 2. Time taken for communication operations Result of load imbalance at the end of the runBecause at this point most of the worker processors run out of tasks while only a few finish their last assignment.

  • Resource reservation systems on most public supercomputer reserve a block of processors making it impossible to release them one by one.

How to deal with limited scalability issue
How to deal with Limited Scalability Issue on Parallel Platforms

  • Idea Production Mode:Number of processors assigned should not be more than Process Number < Threshold Number

  • Use Alternative: Two Steps instead of one

  • Utilizes early stopping condition, which causes the 1st of the two runs to abort when accumulated avg. idle time of workers exceeds a predefined amount (such as 20% of the total run time).

  • Then the remaining part of the calculation is then completed on a smaller number of processors.

Two other problems
Two other problems…. on Parallel Platforms

  • Master processor congestion

  • Redundancy in assignments

  • How to avoid congestion….

    • Improve communications between processors • Implement advance buffering of assignments• Decrease amount of disk I/O• Implement single-CPU optimization techniques

Keys to success
Keys to success on Parallel Platforms

  • Detecting a match between rep & entity to avoid redundancy.

  • Important to sort rep in decreasing order of chance of being similar to the given entity.

  • Estimate chance by giving priority to those reps having a number of residues with 10% of the current entity AND by using similarity in amino acid content based on frequency profiles.

  • The approach is approximate but provides performance gains over a random/sequential choices of reps.

Mpi communication
MPI Communication on Parallel Platforms

  • At first it appears that the efficiency of MPI Communication appear to play an insignificant role in overall performance since communication time is small fraction of the overall CEPAR computation time. However time does add up and MPI does help.

  • Key: Select appropriate MPI send function for the hardware/software in hand.

  • Example: IBM’s implementation of MPI’s blocking send function MPI_Send() is not appropriate because this implementation does not buffer the msg for large msg sizes.

  • MPI Implementation that avoid buffering message can cause deadlock in some cases.

  • In CEPAR no deadlocks occur. However, master processor can be blocked while waiting for some worker processors to finish. MPI_BSend() function for buffered sends solves this problem.

Results on Parallel Platforms

  • Family-First approach outperformed the Entity-first approach.

  • End-of-run load imbalance and allocation of processors were addressed with two-steps

  • Careful Selection of MPI implementation

  • Overall CEPAR performance….

Advantages of cepar
Advantages of CEPAR on Parallel Platforms

  • Ensure high performance computing optimal use

  • Analysis of large amounts of data

  • Can be used on any distribute-memory platform

  • Can scale with the number of processors involved

  • Saves time & computational resources

Summary on Parallel Platforms

  • Efficient use of resource depends on meticulous design of the algorithm and software with performance & scalability given a high priority.

  • Organization of data being feed to processors

  • Optimization of algorithm for distribution of assignments


Proteomics on Parallel Platforms

What is proteomics
What is Proteomics? on Parallel Platforms

  • The study of the proteome.

    • A proteome is “the set of proteins that can be expressed by the genetic material of an organism.”

    • In other words, the study of all proteins, the interactions between them, and “their role in physiological and pathophysiological functions”.

    • Hopefully will directly contribute to a full description of cellular function.

Challenges in proteomics research
Challenges in Proteomics Research on Parallel Platforms

  • Limited and variable sample material.

  • Sample Degradation.

  • Vast dynamic range.

    • For example, in human serum the concentration of albumin is 10 billion times greater than the concentration of the signaling protein interleukin-6.

Challenges in proteomics research cont d
Challenges in Proteomics Research (cont’d) on Parallel Platforms

  • Plethora of post-translational modifications.

  • Nearly boundless tissue.

  • Developmental and temporal specificity.

  • Disease and drug perturbations.

  • “…these difficulties render any comprehensive proteomics project an inherently intimidating and often humbling exercise.”

Five pillars of proteomics research
Five Pillars of Proteomics Research on Parallel Platforms

  • Mass spectrometry-based.

  • Proteome-wide biochemical arrays.

  • Systematic structural biology and imaging techniques.

  • Proteome informatics.

  • Clinical applications.

Mass spectrometry based proteomics
Mass spectrometry-based Proteomics on Parallel Platforms

  • A primary driving force in proteomics.

  • Advancements allow the identification of smaller proteins in more complex mixtures.

  • Initially, research required separation of protein by two-dimensional gel electrophoresis before using mass spectrometry.

    • Limited to the most abundant proteins.

Mass spectrometry based proteomics cont d
Mass spectrometry-based Proteomics (cont’d) on Parallel Platforms

  • Now, mass spectrometric analysis is used directly.

    • Advancements are increasing sensitivity, robustness and data handling.

    • Plenty of work to do…

      • Much higher throughput and sensitivity is needed for observing proteome dynamics and cellular response.

      • More complete sequence coverage.

      • Process and workflow refinement.

      • Automated protein identification.

      • Detection of post-translational modification.

Array based proteomics
Array-based Proteomics on Parallel Platforms

  • Array of immobilized proteins on a support surface.

  • One of the most active areas in biotechnology.

    • Sensitive, high-throughput.

  • Wide range of applications.

    • Diagnostics.

    • Protein-protein interaction.

    • Protein expression profiling on a small or large scale.

    • Target identification and validation in the pharmaceutical industry.

Array based proteomics cont d
Array-based Proteomics (cont’d) on Parallel Platforms

  • Arrays give an abundance of data for a single experiment.

  • Data handling demands sophisticated software and data comparison analysis.

    • Some of the software used for DNA arrays is applicable, along with much of the hardware and detection systems.

Structural proteomics
Structural Proteomics on Parallel Platforms

  • Systematically understanding the structural basis for protein interactions and function.

  • Full description of cell behavior requires structural information for all salient protein complexes and their organization at a cellular level.

  • Requires a wide scale of measurements…

    • From X-ray crystallography and nuclear magnetic resonance at the protein level…

    • …to electron microscopy of mega-complexes and electron tomography for high-resolution visualization of the entire cellular environment.

  • Modeling of dynamics and interaction through computer simulation.

Informatics on Parallel Platforms

  • Proteomics research generates an enormous amount of data.

    • A “simple” experiment for a single microbe involving 90 biological samples could generate 18TB of proteomics data.

    • Sample documentation, rigorous process standards, and proper annotation are necessary.

    • Software development requires a collaborative and documented design process.

      • Data stored as XML with an agreed-upon schema.

      • HUPO (Human Proteome Organization) defines community standards for data representation:

Informatics cont d
Informatics (cont’d) on Parallel Platforms

  • Considerable effort has been applied to interaction databases and systems biology software infrastructure.

  • A system for automating protein identification from mass spectral data is needed for generating databases.

    • Currently a manual and error-prone process.

  • Much was learned from DNA array analysis.

Informatics cont d1
Informatics (cont’d) on Parallel Platforms

  • Current equipment is far from optimal.

    • Manufacturers need time to build platforms tailored specifically for proteomics.

    • Mass spectrometry should improve significantly.

      • Large market for sensitive, affordable mass spectrometers.

    • Robotics for sample preparation.

  • Availability of large datasets will drive research.

    • Modeling cellular behavior.

Informatics cont d2
Informatics (cont’d) on Parallel Platforms

  • Open access for proteomics researchers is needed.

    • Academic institutions typically have the basic necessary tools.

      • Mismanagement of data.

      • Poor throughput.

      • Equipment is extremely expensive.

    • National proteome centers have been proposed to make expertise and equipment more available.

Informatics cont d3
Informatics (cont’d) on Parallel Platforms

  • Lessons learned from genome sequencing.

    • Raw data must be publicly accessible on-line to foster a sense of participation.

    • Agreements that mandate public accessibility and non-patenting of basic data

    • Large-scale efforts must be coordinated to avoid duplication.

      • Also, funding.

Clinical proteomics
Clinical Proteomics on Parallel Platforms

  • Proteomics impacts diagnostics as well as drug discovery.

    • Most drug targets are proteins.

  • Currently a variety of technological platforms in development.

    • Still undecided as to which methods will work best.

  • Robust and high-throughput nature of mass spectrometric instrumentation is imminently suited to clinical applications.

Clinical proteomics cont d
Clinical Proteomics (cont’d) on Parallel Platforms

  • Protein- and antibody-based arrays with validated diagnostic readouts may also become amenable to the clinical setting.

  • Proteomics accelerates drug discovery.

    • Understanding biological networks within a cell will provide a basis for identifying suitable targets.

Computational proteomics examples
Computational Proteomics Examples on Parallel Platforms

  • Protein Docking

    • In cellular biology, function is accomplished by proteins interacting with themselves and other molecular components.

    • Helps verify our understanding of the energetics of macromolecular interactions.

    • Characterization of the structures of protein-protein complexes.

RosettaDock on Parallel Platforms

RosettaDock on Parallel Platforms

TreeDock on Parallel Platforms

  • TreeDock uses a deterministic search

  • Can explore all orientations at a very fine resolution in a reasonable amount of time.

Treedock on Parallel Platforms

  • Searching for docking configurations…

    • Provide models of each molecule

    • Provide anchors for each molecule

      • Not necessary for small molecules, all atoms will be tried

Treedock on Parallel Platforms

  • One molecule has a fixed position, other is movable

  • Movable molecule is translated, rotated while maintaining contact between anchors

  • All positions are tried within a specified resolution

The problem
The Problem on Parallel Platforms

  • Abdominal aortic aneurysms (AAAs) occur in 5-7% of people over age 60 in the US

  • Some individuals have aorta thickening but never have an AAA

  • Chemical precursors to AAA are unknown

  • Current drugs treat the symptoms not the cause

Purpose on Parallel Platforms

  • Analysis of large 2D FTIR microspectroscopic data sets for anomalies to …

  • Determine why infusion of Angiotensin II (AngII) into Apolipoprotein E (apoE) -/- knockout mice causes aorta thickening in some mice and aneurysm in other mice…

  • Identify chemical precursors to AAA and ultimately…

  • Save Lives!

Data analysis issues with 2d ftir microspectroscopy
Data Analysis Issues with 2D FTIR Microspectroscopy on Parallel Platforms

  • Spectral features are a blend of what is in each sample

  • Datasets are very continuous in nature (Principal Component Analysis (PCA) is often not sufficient to identify chemically similar clusters)

  • Subclusters within each PC may be overlooked

  • Large datasets (10s of GBs) require substantial computational resources for typical statistical analysis

Large dataset example
Large Dataset Example on Parallel Platforms

Scores analysis with quantile quantile plots saqq the concept
Scores Analysis with Quantile Quantile Plots on Parallel Platforms(SAQQ) – The Concept

  • Principal Component Analysis (PCA)

  • Quantile-Quantile (QQ) Plotting of a single PC

  • Linear regression to find “normal” distributions

  • Average the original data to find multidimensional centers

  • Calculate loadings with inverse principal axis transformation

Saqq the concept
SAQQ – The Concept on Parallel Platforms

  • Calculate QBEAST distances to all points from each cluster center

  • Reorganize distances into the original map configuration

  • Create “digitally stained” images based upon distance (highlight spectral deviations from the normal distribution)

Principal component analysis
Principal Component Analysis on Parallel Platforms

  • Linear dimension-reduction technique

  • Points in multidimensional space are projected onto a space of fewer dimensions

  • Creates a new coordinate system based upon variance

  • The first axis (PC) has the greatest variance of any projection, the second has the second greatest orthogonal variance, and so on…

Saqq the quantile quantile plot
SAQQ – The Quantile-Quantile Plot on Parallel Platforms

  • Plot order statistics vs. normal cumulative distribution function

Saqq linear regression of the qq plot
SAQQ – Linear Regression of the QQ plot on Parallel Platforms

  • Take the first (next) 10% of the data

  • Calculate r2 and compare to 0.9

  • If r2 > 0.9 add the next point and go to step 2

  • If r2 < 0.9 consider data a cluster and go to step 1

Saqq the quantile quantile plot1
SAQQ – The Quantile-Quantile Plot on Parallel Platforms

  • SAQQ must be applied to all PCs

Saqq continued
SAQQ Continued on Parallel Platforms

  • Average the original data to find multidimensional centers

  • Calculate loadings with inverse principal axis transformation

  • Calculate QBEAST distances to all points from each cluster center

Saqq qbeast distances
SAQQ – QBEAST Distances on Parallel Platforms

  • QBEAST takes into account skew as well as dispersion

  • QBEAST is faster then Mahalanobis as n samples approach d dimensions

  • QQ plot parameterizes non-normal distributions

QBEAST Distances

Mahalanobis Distances

Euclidean Distances

Saqq continued1
SAQQ Continued on Parallel Platforms

  • Reorganize distances into the original map configuration

  • Create “stained” images based upon distance (highlight spectral deviations from the normal distribution)

Cluster analysis using saqq
Cluster Analysis Using SAQQ on Parallel Platforms

6 25 x 6 25 m pixel size 113 pixels x 102 pixels x 410 spectral data points
6.25 x 6.25 on Parallel Platformsμm pixel size (113 pixels x 102 pixels x 410 spectral data points)

Separation of two identical gaussian clusters
Separation of two Identical Gaussian Clusters on Parallel Platforms

  • 3 SDs (cluster displacement)

  • 3 SDs (size increase)

  • 4 SDs (size decrease)

The problem ftir microspectroscopic data overload
The Problem – on Parallel PlatformsFTIR Microspectroscopic Data Overload

  • Approximately 1 GB of raw data per hour collected

  • 100s of GB of data waiting to be analyzed

  • Massive array size (250,000 x 1000 double-precision)

  • Massive file sizes (~ 1 GB compressed binary)

Specific aims
Specific Aims on Parallel Platforms

Identify precursors to AAA by using SAQQ to rapidly reduce data obtained from FTIR microspectrometry producing digitally stained images corresponding to those clusters.

Identify overlapping clusters of collagen I, collagen III, elastin, macrophages, and necrotic debris

Saqq analysis of pc1 of x bk 1
SAQQ analysis of PC1 of x-bk-1 on Parallel Platforms

SAQQ analysis of PC2 of x-bk-1 on Parallel Platforms

Proposed research on abdominal aortic aneurysm
Proposed Research on Abdominal Aortic Aneurysm on Parallel Platforms

  • Process data with SAQQ

    • Understand vessel wall thickening

    • Identify biochemical pathways to aneurysm

  • Develop iterative SAQQ

    • Apply to reduce 60 “stained” images down to 1

  • Develop better linear fitting algorithms

Conclusions on Parallel Platforms

  • SAQQ is a useful method as a digital staining technique

  • SAQQ “stains” based upon chemical significance

  • SAQQ allows progress in determining the chemical process behind AAA formation

References on Parallel Platforms

  • BLAST -

  • CEPAR -

  • Protein Data Bank -

  • Bioinformatics Fields -


References on Parallel Platforms



  • From Genomics to Proteomics. M. Tyers, M. Mann. Nature 2003 Mar;422(6928);193-7.




  • Treedock: A Tool for Protein Docking Based on Minimizing van der Waals Energies. A. Fahmy, G. Wagner. JACS 2002; Vol 124, No. 7

  • Protein-Protein Docking with Simultaneous Optimization of Rigid-body Displacement and Side-chain Conformations. J. Gray, S. Moughon, C. Wang, O. Schueler-Furman, B. Kuhlman, C. Rohl, D. Baker. JMB 2003; Vol 331;281-299