1 / 16

Bioinformatics Research: Developing Effective Algorithms for Genotype-Phenotype Correlation

This overview explores the use of bioinformatics and statistical learning methods to understand correlations between genotype and phenotype, with applications in protein function, drug therapy, and metabolic pathways.

Download Presentation

Bioinformatics Research: Developing Effective Algorithms for Genotype-Phenotype Correlation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective, Expressive, Interpretable

  2. Motivations • Understanding correlations between genotype and phenotype • Predicting genotype <=> phenotype • Phenotypes: • Protein function • Drug/therapy response • Drug-drug interactions for expression • Drug mechanism • Interacting pathways of metabolism

  3. Projects • Homology detection, protein family classification (funded by a DuPont S&E award) • Support Vector Machines • Hidden Markov models • Graph theoretic methods • Probabilistic modeling for BioSequence (funded by NIH) • HMMs, and beyond • Motifs finding • Secondary structure • Comparative Genomics • Identify genome features for diagnostic and therapeutic purposes (funded by an Army grant) • Evolution of metabolic pathways • Tree and graph comparisons

  4. Detect remote homologues Attributes to be looked at: • Sequence similarity, Aggregate statistics (e.g., protein families), Pattern/motif, and more attributes (presence at phylogenetic tree). How to incorporate domain specific knowledge into the model so a classifier can be more accurate? Results: • Quasi-consensus based comparison of profile HMM for protein sequences (submitted to Bioinformatics) • Using extended phylogenetic profiles and support vector machines for protein family classification (SNPD 04) • Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships (JCB 2003)

  5. Support Vector Machines

  6. Data: phylogenetic profiles • - How toaccount for correlations among profile components? • profile extension (Narra & Liao, SNPD 04) Tree-based distance Hamming distance 0 1 1 1 1 x= = 3 0.1 1 1 1 1 1 y= = 3 0.5 z = 1 1 1 1 0

  7. Quasi consensus based comparison of HMMs V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V M1 M2 Consensus 1 Consensus 2 V G A N V A E H V K A T I A E H V G A - - N V A E H V K A - - T I A E H V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V S(c2|M1) S(c1|M2) V - K A - T I A E H V - G A N - V A E H Seed 1 V G A - - H A G E Y V - K A - T I A E H A - G A - H D G E F A G A - - H D G E F V - G A N - V A E H V - G A H - A G E Y Seed 2 Consensus 2 Consensus 1 Seed 2 Seed 1 A - G A - H D G E F V G A - - H A G E Y A G A - - H D G E F V - G A H - A G E Y Aln21 Aln12 From MSA to profile HMMsusing existing packages (SAM-T99 or HMMER) • Generation of quasi consensus • sequence from the model • Alignment of consensus sequence of a • model with the other model • Extraction of two alignments in each • direction

  8. Sequence Models (HMMs and beyond) Motivations: What is responsible for the function? • Patterns/motifs • Secondary structure To capture long range correlations of bio sequences • Transporter proteins • RNA secondary structure Methods: generative versus discriminative • Linear dependent processes • Stochastic grammars • Model equivalence

  9. TMMOD: An improved hidden Markov model for predicting transmembrane topology (to appear in IEEE ICTAI04)

  10. Genomics study of enterobacterial BT agents(funded by the US Army via Center for Biological Defense, USF ) Goals: • Identification of genes and sequence tags as targets for novel diagnosis and therapy • BT agents: Yersinia pestis, Salmonella, Escherichia coli O157:H7) Methods: • Various bioinformatics tools and databases

  11. Comparative Genomics Motivation: • Evolution of metabolic pathways • Gene functions • De novo (alternative pathways) • Genetic engineering • Drug discovery Methods: • Put data into a context: knowledge/data representation • Trees, graphs, etc. • Learning models/methods

  12. P1 P1 Pn  O1 1 0 1  O2 0 1 0        Om 1 0 1 Profiling: pairs of attribute-value

  13. What we found: • Informative way to compare genomes • Majority pathways (or rather their enzyme components) evolve in congruence with species

  14. What we do next: • Database and search engine • Off-line self-consistent iteration • Pathways in a network • Graph comparisons • Identify key components of networks • Small world topology • Cross-level interactions with regulatory networks

More Related