1 / 15

Introduction to Protein Sequence Analysis - Charlie Whittaker/Sebastian Hoersch

Introduction to Protein Sequence Analysis - Charlie Whittaker/Sebastian Hoersch. Course Materials: http://luria.mit.edu/ICBPsessions Topics Lecture/Demo/Exercises Accessing Protein Sequences and Information (Uniprot) Sequence Alignment and Phylogenetic Analysis (BLAST, ClustalX)

daxia
Download Presentation

Introduction to Protein Sequence Analysis - Charlie Whittaker/Sebastian Hoersch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Protein Sequence Analysis - Charlie Whittaker/Sebastian Hoersch • Course Materials: http://luria.mit.edu/ICBPsessions • Topics • Lecture/Demo/Exercises • Accessing Protein Sequences and Information (Uniprot) • Sequence Alignment and Phylogenetic Analysis (BLAST, ClustalX) • Protein Domain and Motif Analysis (SMART, Interpro, Scansite) • Evaluations send email to charliew@mit.edu

  2. Accessing Protein Sequences and Information The large number of different databases and resources can make this difficult. • Different resources: • contain different data • use different identifier schemes • use different definitions of redundancy • Ensembl (genomes), NCBI protein (genbank), IPI and UniProt. • UniProt may be the best place to begin. • Useful X_Y ID scheme • Widespread Usage (SMART, GO) • Abundant manual annotation and cross-referencing tools • Database is mirrored at multiple locations UniProt: http://www.pir.uniprot.org/

  3. Local Sequence Alignment (BLAST) • Searching is done in a pairwise fashion and reported alignments are restricted to the best parts of the query-target relationship. • Multiple BLAST “flavors” allow alignments of protein and DNA in all different combinations. • Relatively fast and sensitive making BLAST the standard tool for searching large datasets using sequence similarity. • Ubiquitous - Virtually all online protein resources have some kind of BLAST implementation. • NCBI may have the best on-line version of the tool. • http://www.ncbi.nlm.nih.gov/blast/

  4. Global Sequence Alignment (MSA) Portion of a multiple, global alignment created with ClustalX The goal is to stack in columns amino acids that derive from an ancestral residue. The quality of pairwise and groupwise alignments are scored using substitution matrices.

  5. Protein Substitution Matrices # Matrix made by matblas from blosum62.iij # * column uses minimum score # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62 # Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 BLOSUM62 Both Local and Global alignments use substitution matrices to quantify relationships between proteins.

  6. Phylogenetic Trees • Clustal uses the Neighbor-Joining Method (NJ) • NJ is a distance-based method that repeatedly groups the 2 most closely related sequences. • The Phylip package is freely available and implements a wide range of different methods. http://evolution.genetics.washington.edu/phylip.html • Tree Reliability • The bootstrap method is used to add confidence levels to the groupings. • Visualization of the tree • NJ Plot • Draws unrooted phylogenetic trees in phenogram format • Other methods allow more control of format for example: http://iubio.bio.indiana.edu/treeapp/treeprint-form.html

  7. Assessing Tree Reliability using Bootstrapping X X Actual Alignment: A Bootstrap Replicate: • Positions within the original alignment are randomly resampled to create a “pseudo replicate”. • Large numbers of pseudo replicates are generated. • The distances between species within each pseudo replicate are calculated and trees are drawn for each. • The stability of clades within the sets are calculated to identify clades that are present in most pseudo replicates.

  8. Phylogenetic Tree Examples ITAL_HUMAN ITA2_HUMAN ( ( ( ITA1_DROME:0.67741, ( ITA6_HUMAN:0.42032, ITA7_HUMAN:0.31161) :0.29176[1000]) :0.11947[992], ( ITA2_DROME:0.72000, ( ITA5_HUMAN:0.37147, ITAV_HUMAN:0.43034) :0.25993[1000]) :0.09502[976]) :0.12118[954], ( ITA5_DROME:1.07810, ( ( ITA10_HUMAN:0.70421, ITA2_HUMAN:0.73710) :0.10612[857], ITAL_HUMAN:0.86603) :0.18550[986]) :0.02936[434], ( ITA4_HUMAN:0.49064, ITA9_HUMAN:0.45807) :0.35160[1000]); 857 ITA10_HUMAN ITA9_HUMAN 1000 ITA4_HUMAN 986 ITAV_HUMAN ITA5_HUMAN 1000 976 ITA2_DROME ITA7_HUMAN 954 ITA6_HUMAN 1000 992 ITA1_DROME ITA5_DROME

  9. Homolog, Ortholog and Paralog A Ancestral Organism Speciation Event Orthologs xA yA Gene Duplication Paralogs Homologs xA yA’ yA’’ • There is no such thing as percent homology. • When there is any doubt, use the term homolog. • How do you identify homologs?

  10. Protein Domains and Motifs • Protein domains are modular units of sequence with consistent structure and function. • Evolution can produce both new domains and novel combinations of domains. • Protein motifs are short sequence patterns with functional implications. Pan-Bilaterian Subgroup B Thrombospondin Deuterostome-specific Subgroup A Thrombospondin CSVTCG CD36-Binding Motif

  11. Protein Domain and Motif Analysis • Models (HMMs) that describe domains are created from alignments. Those models are then used to scan proteins for the presence of domains. • Domains do not need to be characterized or understood to be detected (DUFs). • Motifs are analyzed in a similar way or using simpler methods involving text pattern matching. • Proteins in public databases have already been analyzed for domain content and these data are available from a number of sources.

  12. SMART - http://smart.embl-heidelberg.de/ • SMART is an excellent resource for domain analysis • Integrates data from multiple sources • SMART and pfam domain models • Gene Ontology • Taxonomic data • Genomic data (Ensembl) • Powerful Search Tools • Excellent Graphics

  13. Interproscan - http://www.ebi.ac.uk/InterProScan/ • Includes some of the things found in SMART plus additional models and methods. • Software and data are freely available allowing batch analysis of proteins on local computers.

  14. Scansite - http://scansite.mit.edu/ • Search tool designed to identify substrates of a variety of protein kinases. • Other useful utilities are also available

  15. Functional Annotation of Proteins • Database records • Literature • GO (http://www.geneontology.org/)“The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism. [...] The GO project has developed three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.”GO identifiers are now cross-referenced in many biological databases, including Uniprot and SMART. Available information and nomenclature used dependent on researcher, research focus, species etc.

More Related