Bioinformatics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 90

Bioinformatics PowerPoint PPT Presentation


  • 57 Views
  • Uploaded on
  • Presentation posted in: General

Bioinformatics. Cindy Burklow, Kyle Eli, Clay Harris. What is Bioinformatics?. “Any use of computers to handle biological information.” Or, more specifically: “The use of computers to characterize the molecular components of living things.”. What is Bioinformatics?. Biomolecules

Download Presentation

Bioinformatics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Bioinformatics

Bioinformatics

Cindy Burklow, Kyle Eli, Clay Harris


What is bioinformatics

What is Bioinformatics?

  • “Any use of computers to handle biological information.”

    • Or, more specifically:

  • “The use of computers to characterize the molecular components of living things.”


What is bioinformatics1

What is Bioinformatics?

  • Biomolecules

  • “Doing Bioinformatics”

    • And simulate!

  • Classical bioinformatics deals primarily with sequence analysis

    • Polymers

    • Monomers

    • Macromolecules

    • Sequences


What is bioinformatics2

What is Bioinformatics?

  • “Post-genomic” era

    • Comparative genomics

    • New technologies to measure gene expression

    • Large-scale methods for identifying gene function

    • A shift to finding gene products

      • Proteomics

      • Structural Genomics


Bioinformatic fields

Bioinformatic Fields

Biophysics

Cheminformatics

Computational Biology

Genomics

Mathematical Biology

Medical informatics/Medinformatics

Pharmacogenomics

Pharmacogenetics

Proteomics


Blast

BLAST

  • Basic Local Alignment Search Tool (BLAST)

  • Collection of Software Program Tools

  • Software version 2.1.13 offered by National Center for Biotechnology Information at the National Institutes of Health (NCBI)

  • Compares nucleotide or protein sequences to sequence databases

  • Finds regions of local similarity between sequences

  • Calculates the statistical significance of matches

  • Helps infer functional relationships between sequences and identify members of gene families


Blast1

BLAST

  • Offers different program tools & databases

  • Provides Guide to help users decide on which BLAST tool to used based on Nature & size of the input query Primary goal of the search

  • BLAST search comprises four components:QueryDatabaseProgramSearch purpose/goal


Bioinformatics

BLAST


Bioinformatics

BLAST


Ways to interface with blast

Ways to interface with BLAST

  • Uses Standardized application program interface (API) for accessing the NCBI QBIAst system

  • Uses direct HTTP-encoded requests to NCBI web server

  • Blast utilities allow you to run searches on your own computer

  • NetBlast has command-line network clients that allow you to submit searches to NCBI


A case study of high throughput biological data processing on parallel platforms

A Case Study of High-Throughput Biological Data Processing on Parallel Platforms

San Diego Supercomputer Center and Department of Pharmacology, University of California


History

History

  • Work has been done in this area for over the past 20 years developing structure comparison algorithms for proteins structures

  • Traditionally uses conventional functionally-driven structure determination

  • Algorithm Classifications to build alignments:Single ResiduesFragments of multiple residuesSecondary Structure Elements

    • CHALLENGE: Highly redundant datasets requiring very large computations to be performed to gain insight into the meaning of the data


Protein structures

Protein Structures

Used for protein classifications, better understanding of function and clear explanation of distant homologous relationships not possible from sequence alone since sequence is more variable than structure

  • What is important about Protein Structures?

  • Comparing a single data sequence string against a very large sequence database called Protein Data Bank (PDB)Types of Comparisons

  • Sequence-Sequence

  • Sequence-Structure

  • Structure-Structure


Scale of problem

Scale of Problem

  • Protein Data Bank of 35,000 chains

  • Pairwise comparison = average ~3 seconds.

  • Without considering redundancy or chain size a complete computation would take average….((35,000 * 35,000)/2) * 3 seconds 21,000 processor-days or 58 YEARS!!!!TIME IS A BIG PROBLEM!!!


Problems

Problems

  • Determination & Comparison of 3-D protein structures

  • Massively parallel computations are needed


Background

Background

  • Looking for more efficient way to analyze large data sets

  • Taking advantage of redundancy present in data sets

  • KEY: Data Preprocessing Step & Organization of data being searched BEFORE begin passed to PARALLEL COMPUTERS


Other issues to consider

Other Issues to Consider

  • Algorithm should give optimal performance

  • Scale with the number of processors involved.


Optimization procedures

Optimization Procedures

  • Dynamic Programming

  • Monte-Carlo

  • Graph Theory

  • Combinatorial Search


What does cepar stand for

What does CEPAR stand for?

CE PAR

Combinatorial Extension Algorithm

Parallel Mode


What is combinatorial extension algorithm

What is Combinatorial Extension Algorithm?

  • Method of automatically aligning pairs of structures

  • Compiles an alignment of a give pair of protein chains by considering the chains sectioned intoall possible octapeptide fragments, as defined by the backbone α-carbons

  • Those octapeptide pairs that have high distance-based similarity score are deemed “aligned fragment pairs” & used in the next step of analysis

  • Then the CE algorithms tries to join each Alignment Fragment Pairs (AFP) to a maximal number of other AFPs in order to create the longest possible alignment path through the two proteins in consideration (w/ allowance for gaps of up to 30 residues in either protein chain). Switch together a set of AFPs covering contiguous region.


What is combinatorial extension algorithm1

What is Combinatorial Extension Algorithm?

  • After possible paths through two proteins are determined, CE uses additional heuristics to try to improve the final alignment

  • The 20 best scoring paths are compiled & proteins are directly compared based upon the super-imposition of the aligned residues.

  • The path that yields the lowest Root Mean Square Deviation (RMSD) is retained as the “optimal path”.

  • Then this path is subjected to dynamic programming on structural alignment directly between the two structures, which test all possible residue equivalences & resulting RMSD from their superposition.


Parallel algorithm

Parallel Algorithm

  • CEPAR uses coarse-grain parallel implementation involving a master/worker strategy suitable for a massively parallel computer architecture.

  • A parallel algorithm, as opposed to a traditional serial algorithm, is one which can be executed a piece at a time on many different processing devices, and then put back together again at the end to get the correct result.


What does cepar do

What does CEPAR do?

  • Finds pairwise protein structure similarities

  • Pairwise 3D protein structure comparison

  • Aligns protein structure from Protein Data Bank

  • Matches protein structure-to-structure

  • Runs on a large number of processors


How does cepar work

How does CEPAR work?

  • Optimizes the use of Combinatorial Extension algorithm for the pairwise alignment of polypeptide chains to manage comparative structural information

  • Builds a structurally representative set of protein chains & reveals structure similarities in the Protein Data Bank that scale with fast growing source of data


How does cepar work1

How does CEPAR work?

  • Only one master processor was used. It was not advantageous to use more than one master processor, because communication issues.

  • Each worker receives work assignment from master compares 2 entities contained in the assignment using CE algorithm, returns results of the comparison to the master & is ready to receive another assignment

  • Workers only need to communicate with the Master processor and not each other

  • Program written in C++ and uses MPI for communication between master & workers


Computer

Computer

  • “Blue Horizon” – IBM SP parallel computer at the San Diego Supercomputer Center

  • 1152 Power3+ processors each running at 375MHz

  • Sun Enterprise 10,000 server & Linux PC cluster

  • Software can work on any parallel machine or PC cluster with Message Passing Interface (MPI)


Assignments problem formulation

Assignments & Problem Formulation

  • Entity list of N entities where each entity is protein polypeptide chain characterized by amino acid sequence & a set of 3D coordinates

  • Algorithm for pairwise comparison of entities (CE)

  • Select Representative Protein Structure

  • Order of Operations


Representation criteria notes

Representation Criteria Notes

  • Looking for similarity criterion between representatives

  • Alignments not satisfying this criterion are not recorded

  • Output: List of representatives as well as entities represented by them & detailed information on alignment satisfying either representative or similarity criterion

  • It is not vector quantization (so to minimize computer time)

  • Representatives are randomly chosen instead of calculating the centroid of a cluster

  • Applied criteria is believed to adequately describes the structural space of the Protein Data Bank


Representation criteria

Representation Criteria

Sequence Lengths of two entities: L1 & L2

Length difference threshold parameter: Lthr

Number of aligned positions: Lali

Alignment length threshold parameter: Athr


Representation criteria1

Representation Criteria

Gap threshold parameter: Gthr

Number of residues in gaps: Lgap

Final RMSD of the alignment RMSD < Rthr, where Rthr is the RMSD threshold parameter


Order of operation

Order of Operation

  • Entity-first (2-step)

  • Family-first (2-step)

  • Family-first (1-step)


New problems uncovered

New problems uncovered….

  • Running CEPAR in one step produces limited scalability causes….Limited Scalability

  • WHY? At High processor count…1. Number of idle workers 2. Time taken for communication operations Result of load imbalance at the end of the runBecause at this point most of the worker processors run out of tasks while only a few finish their last assignment.

  • Resource reservation systems on most public supercomputer reserve a block of processors making it impossible to release them one by one.


How to deal with limited scalability issue

How to deal with Limited Scalability Issue

  • Idea Production Mode:Number of processors assigned should not be more than Process Number < Threshold Number

  • Use Alternative: Two Steps instead of one

  • Utilizes early stopping condition, which causes the 1st of the two runs to abort when accumulated avg. idle time of workers exceeds a predefined amount (such as 20% of the total run time).

  • Then the remaining part of the calculation is then completed on a smaller number of processors.


Two other problems

Two other problems….

  • Master processor congestion

  • Redundancy in assignments

  • How to avoid congestion….

    • Improve communications between processors • Implement advance buffering of assignments• Decrease amount of disk I/O• Implement single-CPU optimization techniques


Keys to success

Keys to success

  • Detecting a match between rep & entity to avoid redundancy.

  • Important to sort rep in decreasing order of chance of being similar to the given entity.

  • Estimate chance by giving priority to those reps having a number of residues with 10% of the current entity AND by using similarity in amino acid content based on frequency profiles.

  • The approach is approximate but provides performance gains over a random/sequential choices of reps.


Mpi communication

MPI Communication

  • At first it appears that the efficiency of MPI Communication appear to play an insignificant role in overall performance since communication time is small fraction of the overall CEPAR computation time. However time does add up and MPI does help.

  • Key: Select appropriate MPI send function for the hardware/software in hand.

  • Example: IBM’s implementation of MPI’s blocking send function MPI_Send() is not appropriate because this implementation does not buffer the msg for large msg sizes.

  • MPI Implementation that avoid buffering message can cause deadlock in some cases.

  • In CEPAR no deadlocks occur. However, master processor can be blocked while waiting for some worker processors to finish. MPI_BSend() function for buffered sends solves this problem.


Results

Results

  • Family-First approach outperformed the Entity-first approach.

  • End-of-run load imbalance and allocation of processors were addressed with two-steps

  • Careful Selection of MPI implementation

  • Overall CEPAR performance….


Advantages of cepar

Advantages of CEPAR

  • Ensure high performance computing optimal use

  • Analysis of large amounts of data

  • Can be used on any distribute-memory platform

  • Can scale with the number of processors involved

  • Saves time & computational resources


Summary

Summary

  • Efficient use of resource depends on meticulous design of the algorithm and software with performance & scalability given a high priority.

  • Organization of data being feed to processors

  • Optimization of algorithm for distribution of assignments


Proteomics

Proteomics


What is proteomics

What is Proteomics?

  • The study of the proteome.

    • A proteome is “the set of proteins that can be expressed by the genetic material of an organism.”

    • In other words, the study of all proteins, the interactions between them, and “their role in physiological and pathophysiological functions”.

    • Hopefully will directly contribute to a full description of cellular function.


Challenges in proteomics research

Challenges in Proteomics Research

  • Limited and variable sample material.

  • Sample Degradation.

  • Vast dynamic range.

    • For example, in human serum the concentration of albumin is 10 billion times greater than the concentration of the signaling protein interleukin-6.


Challenges in proteomics research cont d

Challenges in Proteomics Research (cont’d)

  • Plethora of post-translational modifications.

  • Nearly boundless tissue.

  • Developmental and temporal specificity.

  • Disease and drug perturbations.

  • “…these difficulties render any comprehensive proteomics project an inherently intimidating and often humbling exercise.”


Five pillars of proteomics research

Five Pillars of Proteomics Research

  • Mass spectrometry-based.

  • Proteome-wide biochemical arrays.

  • Systematic structural biology and imaging techniques.

  • Proteome informatics.

  • Clinical applications.


Mass spectrometry based proteomics

Mass spectrometry-based Proteomics

  • A primary driving force in proteomics.

  • Advancements allow the identification of smaller proteins in more complex mixtures.

  • Initially, research required separation of protein by two-dimensional gel electrophoresis before using mass spectrometry.

    • Limited to the most abundant proteins.


Mass spectrometry based proteomics cont d

Mass spectrometry-based Proteomics (cont’d)

  • Now, mass spectrometric analysis is used directly.

    • Advancements are increasing sensitivity, robustness and data handling.

    • Plenty of work to do…

      • Much higher throughput and sensitivity is needed for observing proteome dynamics and cellular response.

      • More complete sequence coverage.

      • Process and workflow refinement.

      • Automated protein identification.

      • Detection of post-translational modification.


Array based proteomics

Array-based Proteomics

  • Array of immobilized proteins on a support surface.

  • One of the most active areas in biotechnology.

    • Sensitive, high-throughput.

  • Wide range of applications.

    • Diagnostics.

    • Protein-protein interaction.

    • Protein expression profiling on a small or large scale.

    • Target identification and validation in the pharmaceutical industry.


Array based proteomics cont d

Array-based Proteomics (cont’d)

  • Arrays give an abundance of data for a single experiment.

  • Data handling demands sophisticated software and data comparison analysis.

    • Some of the software used for DNA arrays is applicable, along with much of the hardware and detection systems.


Structural proteomics

Structural Proteomics

  • Systematically understanding the structural basis for protein interactions and function.

  • Full description of cell behavior requires structural information for all salient protein complexes and their organization at a cellular level.

  • Requires a wide scale of measurements…

    • From X-ray crystallography and nuclear magnetic resonance at the protein level…

    • …to electron microscopy of mega-complexes and electron tomography for high-resolution visualization of the entire cellular environment.

  • Modeling of dynamics and interaction through computer simulation.


Informatics

Informatics

  • Proteomics research generates an enormous amount of data.

    • A “simple” experiment for a single microbe involving 90 biological samples could generate 18TB of proteomics data.

    • Sample documentation, rigorous process standards, and proper annotation are necessary.

    • Software development requires a collaborative and documented design process.

      • Data stored as XML with an agreed-upon schema.

      • HUPO (Human Proteome Organization) defines community standards for data representation: http://psidev.sourceforge.net/


Informatics cont d

Informatics (cont’d)

  • Considerable effort has been applied to interaction databases and systems biology software infrastructure.

  • A system for automating protein identification from mass spectral data is needed for generating databases.

    • Currently a manual and error-prone process.

  • Much was learned from DNA array analysis.


Informatics cont d1

Informatics (cont’d)

  • Current equipment is far from optimal.

    • Manufacturers need time to build platforms tailored specifically for proteomics.

    • Mass spectrometry should improve significantly.

      • Large market for sensitive, affordable mass spectrometers.

    • Robotics for sample preparation.

  • Availability of large datasets will drive research.

    • Modeling cellular behavior.


Informatics cont d2

Informatics (cont’d)

  • Open access for proteomics researchers is needed.

    • Academic institutions typically have the basic necessary tools.

      • Mismanagement of data.

      • Poor throughput.

      • Equipment is extremely expensive.

    • National proteome centers have been proposed to make expertise and equipment more available.


Informatics cont d3

Informatics (cont’d)

  • Lessons learned from genome sequencing.

    • Raw data must be publicly accessible on-line to foster a sense of participation.

    • Agreements that mandate public accessibility and non-patenting of basic data

    • Large-scale efforts must be coordinated to avoid duplication.

      • Also, funding.


Clinical proteomics

Clinical Proteomics

  • Proteomics impacts diagnostics as well as drug discovery.

    • Most drug targets are proteins.

  • Currently a variety of technological platforms in development.

    • Still undecided as to which methods will work best.

  • Robust and high-throughput nature of mass spectrometric instrumentation is imminently suited to clinical applications.


Clinical proteomics cont d

Clinical Proteomics (cont’d)

  • Protein- and antibody-based arrays with validated diagnostic readouts may also become amenable to the clinical setting.

  • Proteomics accelerates drug discovery.

    • Understanding biological networks within a cell will provide a basis for identifying suitable targets.


Computational proteomics examples

Computational Proteomics Examples

  • Protein Docking

    • In cellular biology, function is accomplished by proteins interacting with themselves and other molecular components.

    • Helps verify our understanding of the energetics of macromolecular interactions.

    • Characterization of the structures of protein-protein complexes.


Rosettadock

RosettaDock


Rosettadock1

RosettaDock


Treedock

TreeDock

  • TreeDock uses a deterministic search

  • Can explore all orientations at a very fine resolution in a reasonable amount of time.


Treedock1

Treedock

  • Searching for docking configurations…

    • Provide models of each molecule

    • Provide anchors for each molecule

      • Not necessary for small molecules, all atoms will be tried


Treedock2

Treedock

  • One molecule has a fixed position, other is movable

  • Movable molecule is translated, rotated while maintaining contact between anchors

  • All positions are tried within a specified resolution


Synchrotron ir analysis of murine abdominal aortic aneurysm

Synchrotron IR Analysis of Murine Abdominal Aortic Aneurysm


The problem

The Problem

  • Abdominal aortic aneurysms (AAAs) occur in 5-7% of people over age 60 in the US

  • Some individuals have aorta thickening but never have an AAA

  • Chemical precursors to AAA are unknown

  • Current drugs treat the symptoms not the cause


Purpose

Purpose

  • Analysis of large 2D FTIR microspectroscopic data sets for anomalies to …

  • Determine why infusion of Angiotensin II (AngII) into Apolipoprotein E (apoE) -/- knockout mice causes aorta thickening in some mice and aneurysm in other mice…

  • Identify chemical precursors to AAA and ultimately…

  • Save Lives!


Data analysis issues with 2d ftir microspectroscopy

Data Analysis Issues with 2D FTIR Microspectroscopy

  • Spectral features are a blend of what is in each sample

  • Datasets are very continuous in nature (Principal Component Analysis (PCA) is often not sufficient to identify chemically similar clusters)

  • Subclusters within each PC may be overlooked

  • Large datasets (10s of GBs) require substantial computational resources for typical statistical analysis


Large dataset example

Large Dataset Example


Scores analysis with quantile quantile plots saqq the concept

Scores Analysis with Quantile Quantile Plots(SAQQ) – The Concept

  • Principal Component Analysis (PCA)

  • Quantile-Quantile (QQ) Plotting of a single PC

  • Linear regression to find “normal” distributions

  • Average the original data to find multidimensional centers

  • Calculate loadings with inverse principal axis transformation


Saqq the concept

SAQQ – The Concept

  • Calculate QBEAST distances to all points from each cluster center

  • Reorganize distances into the original map configuration

  • Create “digitally stained” images based upon distance (highlight spectral deviations from the normal distribution)


Principal component analysis

Principal Component Analysis

  • Linear dimension-reduction technique

  • Points in multidimensional space are projected onto a space of fewer dimensions

  • Creates a new coordinate system based upon variance

  • The first axis (PC) has the greatest variance of any projection, the second has the second greatest orthogonal variance, and so on…


Saqq the quantile quantile plot

SAQQ – The Quantile-Quantile Plot

  • Plot order statistics vs. normal cumulative distribution function


Saqq linear regression of the qq plot

SAQQ – Linear Regression of the QQ plot

  • Take the first (next) 10% of the data

  • Calculate r2 and compare to 0.9

  • If r2 > 0.9 add the next point and go to step 2

  • If r2 < 0.9 consider data a cluster and go to step 1


Saqq the quantile quantile plot1

SAQQ – The Quantile-Quantile Plot

  • SAQQ must be applied to all PCs


Saqq continued

SAQQ Continued

  • Average the original data to find multidimensional centers

  • Calculate loadings with inverse principal axis transformation

  • Calculate QBEAST distances to all points from each cluster center


Saqq qbeast distances

SAQQ – QBEAST Distances

  • QBEAST takes into account skew as well as dispersion

  • QBEAST is faster then Mahalanobis as n samples approach d dimensions

  • QQ plot parameterizes non-normal distributions

QBEAST Distances

Mahalanobis Distances

Euclidean Distances


Saqq continued1

SAQQ Continued

  • Reorganize distances into the original map configuration

  • Create “stained” images based upon distance (highlight spectral deviations from the normal distribution)


Cluster analysis using saqq

Cluster Analysis Using SAQQ


6 25 x 6 25 m pixel size 113 pixels x 102 pixels x 410 spectral data points

6.25 x 6.25 μm pixel size (113 pixels x 102 pixels x 410 spectral data points)


Separation of two identical gaussian clusters

Separation of two Identical Gaussian Clusters

  • 3 SDs (cluster displacement)

  • 3 SDs (size increase)

  • 4 SDs (size decrease)


The problem ftir microspectroscopic data overload

The Problem – FTIR Microspectroscopic Data Overload

  • Approximately 1 GB of raw data per hour collected

  • 100s of GB of data waiting to be analyzed

  • Massive array size (250,000 x 1000 double-precision)

  • Massive file sizes (~ 1 GB compressed binary)


Specific aims

Specific Aims

Identify precursors to AAA by using SAQQ to rapidly reduce data obtained from FTIR microspectrometry producing digitally stained images corresponding to those clusters.

Identify overlapping clusters of collagen I, collagen III, elastin, macrophages, and necrotic debris


Saqq analysis of pc1 of x bk 1

SAQQ analysis of PC1 of x-bk-1


Bioinformatics

SAQQ analysis of PC2 of x-bk-1


Proposed research on abdominal aortic aneurysm

Proposed Research on Abdominal Aortic Aneurysm

  • Process data with SAQQ

    • Understand vessel wall thickening

    • Identify biochemical pathways to aneurysm

  • Develop iterative SAQQ

    • Apply to reduce 60 “stained” images down to 1

  • Develop better linear fitting algorithms


Conclusions

Conclusions

  • SAQQ is a useful method as a digital staining technique

  • SAQQ “stains” based upon chemical significance

  • SAQQ allows progress in determining the chemical process behind AAA formation


References

References

  • BLAST - http://www.ncbi.nlm.nih.gov/BLAST/

  • CEPAR - http://www.sdsc.edu/http://www.sdsc.edu/pb/papers/cepar.pdf

  • Protein Data Bank - http://www.rcsb.org/pdb

  • Bioinformatics Fields - http://www.bioplanet.com/bioinformatics_faq.html

  • http://www.bioplanet.com/bioinformatics_faq.html


References1

References

  • http://www.bioplanet.com/bioinformatics_faq.html

  • http://www.answers.com/proteome

  • From Genomics to Proteomics. M. Tyers, M. Mann. Nature 2003 Mar;422(6928);193-7.

  • http://www.chem.agilent.com/cag/feature/02-04/Feb04_Serum.htm

  • http://www.functionalgenomics.org.uk/sections/resources/protein_arrays.htm

  • http://doegenomestolife.org/research/facilities/fac3table1.shtml

  • Treedock: A Tool for Protein Docking Based on Minimizing van der Waals Energies. A. Fahmy, G. Wagner. JACS 2002; Vol 124, No. 7

  • Protein-Protein Docking with Simultaneous Optimization of Rigid-body Displacement and Side-chain Conformations. J. Gray, S. Moughon, C. Wang, O. Schueler-Furman, B. Kuhlman, C. Rohl, D. Baker. JMB 2003; Vol 331;281-299


  • Login