1 / 56

Computing Patterns in Biology

Systems biology - figure out circuits. of regulation, predict outcome ... Very simple to find these patterns - can even use the

Kelvin_Ajay
Download Presentation

Computing Patterns in Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing Patterns in Biology Stuart M. Brown New York University School of Medicine

  2. Why Compute Biological Patterns? • Because we can • (computer scientists love to find “interesting” problems) • patterns are beautiful • Its practical - helps with genecloning experiments, predict functions of new proteins • Systems biology - figure out circuits of regulation, predict outcome of changes, design new biological systems

  3. Overview • Restriction sites • Finding genes in DNA sequences • Regulatory sites in DNA • Protein signals (transport and processing) • Protein functional Motifs • Protein families • Protein 3-D structure

  4. Restriction Sites • Bacteria make restriction enzymes that cut DNA at specific sequences (4-8 base patterns) • Very simple to find these patterns - can even use the “Find” function of your web browser or word processor • Open any page of text and look for “CAT” • you now have a restriction site search program!

  5. NEBcutter2 http://tools.neb.com/NEBcutter2/

  6. Translate (in all 6 reading frames) and look for similarity to known protein sequences Look for long Open Reading Frames (ORFs) between start and stop codons (start=ATG, stop=TAA, TAG, TGA) Look for known gene markers TAATAA box, intron splice sites, etc. Statistical methods (codon preference) Finding Genes in Genomic DNA

  7. GCCACATGTAGATAATTGAAACTGGATCCTCATCCCTCGCCTTGTACAAAAATCAACTCCAGATGGATCTAAGATTTAAATCTAACACCTGAAACCATAAAAATTCTAGGAGATAACACTGGCAAAGCTATTCTAGACATTGGCTTAGGCAAAGAGTTCGTGACCAAGAACCCAAAAGCAAATGCAACAAAAACAAAAATAAATAGGTGGGACCTGATTAAACTGAAAAGCCTCTGCACAGCAAAAGAAATAATCAGCAGAGTAAACAGACAACCCACAGAATGAGAGAAAATATTTGCAAACCATGCATCTGATGACAAAGGACTAATATCCAGAATCTACAAGGAACTCAAACAAATCAGCAAGAAAAAAATAACCCCATCAAAAAGTGGGCAAAGGAATGAATAGACAATTCTCAAAATATACAAATGGCCAATAAACATACGAAAAACTGTTCAACATCACTAATTATCAGGGAAATGCAAATTAAAACCACAATGAGATGCCACCTTACTCCTGCAAGAATGGCCATAATAAAAAAAAATCAAAAAAGAATAAATGTTGGTGTGAATGTGGTGAAAAGAGAACACTTTGACACTGCTGGTGGGAATGGAAACTAGTACAACCACTGTGGAAAACAGTACCGAGATTTCTTAAAGAACTACAAGTAGAACTACCATTTGATCCAGCAATCCCACTACTGGGTATCTACCCAGAGGAAAAGAAGTCATTATTTGAAAAAGACACTTGTACATACATGTTTATAGCAGCACAATTTGCAATTGCAAAGATATGGAACCAGTCTAAATGCCCATCAACCAACAAATGGATAAAGAAAATATGGTATATATACACCATGGAACACTACTCAGCCATAAAAAGGAACAAAATAATGGCAACTCACAGATGGAGTTGGAGACCACTATTCTAAGTGAAATAACTCAGGAATGGAAAACCAAATATTGTATGTTCTCACTTATAAGTGGGAGCTAAGCTATGAGGACAAAAGGCATAAGAATTATACTATGGACTTTGGGGACTCGGGGGAAAGGGTGGGAGGGGGATGAGGGACAAAAGACTACACATTGGGTGCAGTGTACACTGCTGAGGTGATGGGTGCACCAAAATCTCAGAAATTACCACTAAAGAACTTATCCATGTAACTAAAAACCACCTCTACCCAAATAATTTTGAAATAAAAAATAAAAATATTTTAAAAAGAACTCTTTAAAATAAATAATGAAAAGCACCAACAGACTTATGAACAGGCAATAGAAAAAATGAGAAATAGAAAGGAATACAAATAAAAGTACAGAAAAAAAATATGGCAAGTTATTCAACCAAACTGGTAATTTGAAATCCAGATTGAAATAATGCAAAAAAAAGGCAATTTCTGGCACCATGGCAGACCAGGTACCTGGATGATCTGTTGCTGAAAACAACTGAAAATGCTGGTTAAAATATATTAACACATTCTTGAATACAGTCATGGCCAAAGGAAGTCACATGACTAAGCCCACAGTCAAGGAGTGAGAAAGTATTCTCTACCTACCATGAGGCCAGGGCAAGGGTGTGCACTTTTTTTTTTCTTCTGTTCATTGAATACAGTCACTGTGTATTTTACATACTTTCATTTAGTCTTATGACAATCCTATGAAACAAGTACTTTTAAAAAAATTGAGATAACAGTTGCATACCGTGAAATTCATCCATTTAAAGTGAGCAATTCACAGGTGCAGCTAGCTCAGTCAGCAGAGCATAAGACTCTTAAAGTGAACAATTCAGTGCTTTTTAGTATATTCACAGAGTTGTGCAACCATCACCACTATCTAATTGGTCTTAGTCTGTTTGGGCTGCCATAACAAAATACCACAAACTGGATAGCTCATAAACAACAGGCATTTATTGCTCACAGTTCTAGAGGCTGGAAGTGCAAGATTAAGATGCCAGCAGATTCTGTGTCTGCTGAGGGCCTGTTCCTCATAGAAGGTGCCCTCTTGCTGAATTCTCACATGGTGGAAGGGGGAAAACAAGCTTGCATTGCAAAGAGGTGGGCCTCTTTAATCCCAAAGGCCCCACCTCTAAAAGGCCCCACTTCTGAATACCATTACATTGAGAATTAAGTTTCAACATAGGAATTTGGGGGAACACAAATATCCAGACTGTAGCATAATTCCAGAACGGATTCATGCCACATGTAGATAATTGAAACTGGATCCTCATCCCTCGCCTTGTACAAAAATCAACTCCAGATGGATCTAAGATTTAAATCTAACACCTGAAACCATAAAAATTCTAGGAGATAACACTGGCAAAGCTATTCTAGACATTGGCTTAGGCAAAGAGTTCGTGACCAAGAACCCAAAAGCAAATGCAACAAAAACAAAAATAAATAGGTGGGACCTGATTAAACTGAAAAGCCTCTGCACAGCAAAAGAAATAATCAGCAGAGTAAACAGACAACCCACAGAATGAGAGAAAATATTTGCAAACCATGCATCTGATGACAAAGGACTAATATCCAGAATCTACAAGGAACTCAAACAAATCAGCAAGAAAAAAATAACCCCATCAAAAAGTGGGCAAAGGAATGAATAGACAATTCTCAAAATATACAAATGGCCAATAAACATACGAAAAACTGTTCAACATCACTAATTATCAGGGAAATGCAAATTAAAACCACAATGAGATGCCACCTTACTCCTGCAAGAATGGCCATAATAAAAAAAAATCAAAAAAGAATAAATGTTGGTGTGAATGTGGTGAAAAGAGAACACTTTGACACTGCTGGTGGGAATGGAAACTAGTACAACCACTGTGGAAAACAGTACCGAGATTTCTTAAAGAACTACAAGTAGAACTACCATTTGATCCAGCAATCCCACTACTGGGTATCTACCCAGAGGAAAAGAAGTCATTATTTGAAAAAGACACTTGTACATACATGTTTATAGCAGCACAATTTGCAATTGCAAAGATATGGAACCAGTCTAAATGCCCATCAACCAACAAATGGATAAAGAAAATATGGTATATATACACCATGGAACACTACTCAGCCATAAAAAGGAACAAAATAATGGCAACTCACAGATGGAGTTGGAGACCACTATTCTAAGTGAAATAACTCAGGAATGGAAAACCAAATATTGTATGTTCTCACTTATAAGTGGGAGCTAAGCTATGAGGACAAAAGGCATAAGAATTATACTATGGACTTTGGGGACTCGGGGGAAAGGGTGGGAGGGGGATGAGGGACAAAAGACTACACATTGGGTGCAGTGTACACTGCTGAGGTGATGGGTGCACCAAAATCTCAGAAATTACCACTAAAGAACTTATCCATGTAACTAAAAACCACCTCTACCCAAATAATTTTGAAATAAAAAATAAAAATATTTTAAAAAGAACTCTTTAAAATAAATAATGAAAAGCACCAACAGACTTATGAACAGGCAATAGAAAAAATGAGAAATAGAAAGGAATACAAATAAAAGTACAGAAAAAAAATATGGCAAGTTATTCAACCAAACTGGTAATTTGAAATCCAGATTGAAATAATGCAAAAAAAAGGCAATTTCTGGCACCATGGCAGACCAGGTACCTGGATGATCTGTTGCTGAAAACAACTGAAAATGCTGGTTAAAATATATTAACACATTCTTGAATACAGTCATGGCCAAAGGAAGTCACATGACTAAGCCCACAGTCAAGGAGTGAGAAAGTATTCTCTACCTACCATGAGGCCAGGGCAAGGGTGTGCACTTTTTTTTTTCTTCTGTTCATTGAATACAGTCACTGTGTATTTTACATACTTTCATTTAGTCTTATGACAATCCTATGAAACAAGTACTTTTAAAAAAATTGAGATAACAGTTGCATACCGTGAAATTCATCCATTTAAAGTGAGCAATTCACAGGTGCAGCTAGCTCAGTCAGCAGAGCATAAGACTCTTAAAGTGAACAATTCAGTGCTTTTTAGTATATTCACAGAGTTGTGCAACCATCACCACTATCTAATTGGTCTTAGTCTGTTTGGGCTGCCATAACAAAATACCACAAACTGGATAGCTCATAAACAACAGGCATTTATTGCTCACAGTTCTAGAGGCTGGAAGTGCAAGATTAAGATGCCAGCAGATTCTGTGTCTGCTGAGGGCCTGTTCCTCATAGAAGGTGCCCTCTTGCTGAATTCTCACATGGTGGAAGGGGGAAAACAAGCTTGCATTGCAAAGAGGTGGGCCTCTTTAATCCCAAAGGCCCCACCTCTAAAAGGCCCCACTTCTGAATACCATTACATTGAGAATTAAGTTTCAACATAGGAATTTGGGGGAACACAAATATCCAGACTGTAGCATAATTCCAGAACGGATTCAT

  8. Intron/Exon structure • Gene finding programs work well in bacteria • None of the gene prediction programs do an adequate job predicting intron/exon boundaries • The only reasonable gene models are based on alignment of cDNAs to genome sequence • Perhaps 50% of all human genes still do not have a correct coding sequence defined (transcription start, intron splice sites)

  9. GRAIL: Oak Ridge Natl. Lab, Oak Ridge, TN http://compbio.ornl.gov/grailexp ORFfinder: NCBI http://www.ncbi.nlm.nih.gov/gorf/gorf.html DNA translation:Univ. of Minnesota Med. School http://alces.med.umn.edu/webtrans.html GenLang http://cbil.humgen.upenn.edu/~sdong/genlang.html BCM GeneFinder:Baylor College of Medicine, Houston, TX http://dot.imgen.bcm.tmc.edu:9331/seq-search/gene-search.html http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html Gene Finding on the Web

  10. Genomic Sequence • Once each gene is located on the chromosome, it becomes possible to get upstream genomic sequence • This is where transcription factor (TF) binding sites are located • promoters and enhancers • Search for known TF sites, and discover new ones (among co-regulated genes)

  11. Phage CRO repressor bound to DNA Andrew Coulson & Roger Sayles with RasMol, Univ. of Edinburgh 1993

  12. Databases of promoters, enhancers, etc. TransFacthe Transcription Factor database 4342 entries w/ known protein binding and transcriptional regulatory functions Maintained by Gesellschaft for Biotechnologische Forschung mbH (Braunschweig, Germany) The Eukaryotic Promoter Database (EPD) Bucher & Trifonov. (1986) NAR 14: 10009-26 1314 entries taken directly from scientific literature Maintained by ISREC (Lausanne, Switzerland) as a subset of the EMBL Many DNA Regulatory Sequences are Known

  13. TF Binding sites lack information • Most TF binding sites are determined by just a few base pairs (typically 6) • Sequence is variable (consensus) • This is not enough information for proteins to locate unique promoters for each gene • TF's bind cooperatively and combinatorially • the key is in the location in relation to each other and to the transcription units of genes • Can use information from alignment of related genes

  14. Sequence Logos

  15. GCG: FINDPATTERNS with database file: TFSITES.DAT Macintosh (Signal Scan), PC/UNIX (Promoter Scan) Dr. Dan S. Prestridge, Univ. of Minnesota Tools to find TF sites in DNA

  16. Websites for Promoter finding Promoter Scan: NIH Bioinformatics (BIMAS) http://bimas.dcrt.nih.gov/molbio/proscan/ Promoter Scan II: Univ. of Minnesota & Axyx Pharmaceuticals http://biosci.cbs.umn.edu/software/proscan/promoterscan.htm Signal Scan: NIH Bioinformatics (BIMAS) http://bimas.dcrt.nih.gov:80/molbio/signal/index.html Transcription Element Search (TESS): Center for Bioinformatics, Univ. of Pennsylvania http://www.cbil.upenn.edu/tess/ Search TransFac at GBF with MatInspector, PatSearch, and FunSiteP http://transfac.gbf-braunschweig.de/TRANSFAC/programs.html TargetFinder: Telethon Inst.of Genetics and Medicine, Milan, Italy http://hercules.tigem.it/TargetFinder.html

  17. Protein Sequence

  18. Molecular properties (pH, mol. wt. isoelectric point, hydrophobicity) Motifs (signal peptide, coiled-coil, trans-membrane, etc.) Protein Families Secondary Structure (helix vs. beta-sheet) 3-D prediction, Threading Protein Sequence Analysis

  19. Proteins are linear polymers of 20 amino acids Chemical properties of the protein are determined by its amino acids Molecular wt., pH, isoelectric point are simple calculations from amino acid composition Hydrophobicity is a property of groups of amino acids - best examined as a graph Chemical Properties of Proteins

  20. Hydrophobicity Plot P53_HUMAN (P04637) human cellular tumor antigen p53 Kyte-Doolittle hydrophilicty, window=19

  21. Web Sites for Simple Protein Analysis • Protein Hydrophobicity Server: Bioinformatics Unit, Weizmann Institute of Science , Israel http://bioinformatics.weizmann.ac.il/hydroph/ • SAPS - statistical analysis of protein sequences: composition, charge, hydrophobic and transmembrane segments, cysteine spacings, repeats and periodicity http://www.isrec.isb-sib.ch/software/SAPS_form.html

  22. EMBOSS Protein Analysis Toolkit • plotorf:simple open reading frame finder • Garnier: predicts 2ndary structure • Charge: plot of protein charge • Octanol: hydrophobicity plot • Pepwindow: hydorpathy plot • pepinfo:plotsprotein secondary structure and hydrophobicity in parallel panels • tmap: predict transmembrane regions • Topo: draws a map of transmembrane protein • Pepwheel: shows protein sequence as helical wheel • Pepcoil: predicts coiled-coil domains • Helixturnhelix: predicts helix-turn-helix domains

  23. Common structural motifs Membrane spanning Signal peptide Coiled coil Helix-turn-helix Simple Motifs

  24. Common structural motifs Membrane spanning (GCG= TransMem) Signal peptide (GCG= SPScan) Coiled coil (GCG= CoilScan) Helix-turn-helix (GCG = HTHScan) “Super-secondary” Structure

  25. Predict Protein server: : EMBL Heidelberg http://www.embl-heidelberg.de/predictprotein/ SOSUI: Tokyo Univ. of Ag. & Tech., Japan http://www.tuat.ac.jp/~mitaku/adv_sosui/submit.html TMpred (transmembrane prediction): ISREC (Swiss Institute for Experimental Cancer Research) http://www.isrec.isb-sib.ch/software/TMPRED_form.html COILS (coiled coil prediction): ISREC http://www.isrec.isb-sib.ch/software/COILS_form.html SignalP (signal peptides): Tech. Univ. of Denmark http://www.cbs.dtu.dk/services/SignalP/ Web servers that predict these structures

  26. Protein Domains/Motifs • Proteins are built out of functional units know as domains (or motifs) • These domains have conserved sequences • Often much more similar than their respective proteins • Exon splicing theory (W. Gilbert) • Exons correspond to folding domains which in turn serve as functional units • Unrelated proteins may share a single similar exon (i.e.. ATPase or DNA binding function)

  27. Motifs are built from Multiple Alignmennts

  28. Protein Motif Databases • Known protein motifs have been collected in databases • Best database is PROSITE • The Dictionary of Protein Sites and Patterns • maintained by Amos Bairoch, at the Univ. of Geneva, Switzerland • contains a comprehensive list of documented protein domains constructed by expert molecular biologists • Alignments and patterns built by hand!

  29. PROSITE is based on Patterns Each domain is defined by a simple pattern • Patterns can have alternate amino acids in each position and defined spaces, but no gaps • Pattern searching is by exact matching, so any new variant will not be found (can allow mismatches, but this weakens the algorithm)

  30. Tools for PROSITE searches Free Mac program: MacPattern • ftp://ftp.ebi.ac.uk/pub/software/mac/macpattern.hqx Free PC program (DOS): PATMAT • ftp://ncbi.nlm.nih.gov/repository/blocks/patmat.dos GCG provides the program MOTIFS • Also in virtually all commercial programs: MacVector, OMIGA, LaserGene, etc.

  31. Websites for PROSITE Searches ScanProsite at ExPASy: Univ. of Geneva • http://expasy.hcuge.ch/sprot/scnpsit1.html Network Protein Sequence Analysis: Institut de Biologie et Chimie des Protéines, Lyon, France • http://pbil.ibcp.fr/NPSA/npsa_prosite.html PPSRCH:EBI, Cambridge, UK • http://www2.ebi.ac.uk/ppsearch/

  32. Profiles • Profiles are tables of amino acid frequencies at each position in a motif • They are built from multiple alignments • PROSITE entries also contain profiles built from an alignment of proteins that match the pattern • Profile searching is more sensitive than pattern searching - uses an alignment algorithm, allows gaps

  33. EMBOSS ProfileSearch • EMBOSS has a set of profile analysis tools. • Start with a multiple alignment • fuzzpro: protein pattern search • preg: regular expression search of a protein sequence • prophecy: create a profile • profit:scans a database with your profile • prophetmakes pairwise alignments between a single sequence and a profile • patmatmotifs: scan a query protein with the PROSITE motif database

  34. Websites for Profile searching • PROSITE ProfileScan: ExPASy, Geneva • http://www.isrec.isb-sib.ch/software/PFSCAN_form.html • BLOCKS (builds profiles from PROSITE entries and adds all matching sequences in SwissProt): Fred Hutchinson Cancer Research Center, Seattle, Washington, USA • http://www.blocks.fhcrc.org/blocks_search.html • PRINTS(profiles built from automatic alignments of OWL non-redundant protein databases): http://www.biochem.ucl.ac.uk/cgi-bin/fingerPRINTScan/fps/PathForm.cgi

  35. More Protein Motif Databases • PFAM(1344 protein familyHMM profiles built by hand):Washington Univ., St. Louis • http://pfam.wustl.edu/hmmsearch.shtml • ProDom (profiles built from PSI-BLAST automatic multiple alignments of the SwissProt database): INRA, Toulouse, France • http://www.toulouse.inra.fr/prodom/doc/blast_form.html [This is my favorite protein database - nicely colored results]

  36. Sample ProDom Output

  37. Hidden Markov Models • Hidden Markov Models (HMMs) are a more sophisticated form of profile analysis. • Rather than build a table of amino acid frequencies at each position, they model the transition from one amino acid to the next. • Pfam is built with HMMs. • EMBOSS HMM tools (HMMER): ehmmBuild ehmmCalibrate ehmmSearch ehmmPfam ehmmAlign ehmmEmit ehmmFetch ehmmIndex

  38. Discovery of new Motifs • All of the tools discussed so far rely on a database of existing domains/motifs • How to discover new motifs • Start with a set of related proteins • Make a multiple alignment • Build a pattern or profile • You will need access to a fairly powerful UNIX computer to search databases with custom built profiles or HMMs.

  39. Patterns in Unaligned Sequences • Sometimes sequences may share just a small common region • transcription factors • MEME:San Diego Supercomputing Facility http://www.sdsc.edu/MEME/meme/website/meme.html • EMBOSSalso includes the MEME program

  40. Protein 3-D Structure

  41. Proteins self-assemble in solution All of the information necessary to determine the complex 3-D structure is in the amino acid sequences Structure determines function - lock & key model of enzyme function Know the sequence, know the function? Nearly infinite complexity Self-assembly

  42. Structure prediction • Protein Structure prediction is the “Holy Grail” of bioinformatics • Since structure = function, then structure prediction should allow protein design, design of inhibitors, etc. • Huge amounts of genome data - what are the functions of all of these proteins?

  43. Cannot be accurately predicted from sequence alone (known as ab initio) Levinthal’s paradox: a 100 aa protein has 3200possible backbone configurations - many orders of magnitude beyond the capacity of the fastest computers There are perhaps only a few hundred basic structures, but we don’t yet have this vocabulary or the ability to recognize variants on a theme 3-D Structure

  44. Secondary Structure • Protein secondary structure takes one of three forms: • Alpha helix • Beta pleated sheet • Turn • 2ndary structure is predicted within a small window • Many different algorithms, not highly accurate • Better predictions from a multiple alignment

  45. Secondary Structural Content Prediction (SSCP): EMBL, Heidelberg http://www.bork.embl-heidelberg.de/SSCP/sscp_seq.html BCM Search Launcher: Protein Secondary Structure Prediction: Baylor College of Medicine http://dot.imgen.bcm.tmc.edu:9331/seq-search/struc-predict.html PREDATOR: EMBL, Heidelberg http://www.embl-heidelberg.de/cgi/predator_serv.pl Structure Prediction on the Web

  46. Sample 2-D Structure Prediction

  47. Threading Protein Structures Best bet is to compare with similar sequences that have known structures >> Threading • Only works for proteins with >25% sequence similarity to a protein with known structure • Some websites offer quick approximations • Will improve as more 3-D structures are described

  48. Websites for 3-D structure prediction UCLA-DOE Protein Fold Recognition http://www.doe-mbi.ucla.edu/people/fischer/TEST/getsequence.html SwissModel: ExPASy, Univ. of Geneva http://www.expasy.ch/swissmod/SWISS-MODEL.html CPHmodels: Technical Univ. of Denmark http://www.cbs.dtu.dk/services/CPHmodels/

More Related