Introduction to Protein Sequence Analysis - Charlie Whittaker/Sebastian Hoersch

Introduction to Protein Sequence Analysis - Charlie Whittaker/Sebastian Hoersch • Course Materials: http://luria.mit.edu/ICBPsessions • Topics • Lecture/Demo/Exercises • Accessing Protein Sequences and Information (Uniprot) • Sequence Alignment and Phylogenetic Analysis (BLAST, ClustalX) • Protein Domain and Motif Analysis (SMART, Interpro, Scansite) • Evaluations send email to charliew@mit.edu

Accessing Protein Sequences and Information The large number of different databases and resources can make this difficult. • Different resources: • contain different data • use different identifier schemes • use different definitions of redundancy • Ensembl (genomes), NCBI protein (genbank), IPI and UniProt. • UniProt may be the best place to begin. • Useful X_Y ID scheme • Widespread Usage (SMART, GO) • Abundant manual annotation and cross-referencing tools • Database is mirrored at multiple locations UniProt: http://www.pir.uniprot.org/

Local Sequence Alignment (BLAST) • Searching is done in a pairwise fashion and reported alignments are restricted to the best parts of the query-target relationship. • Multiple BLAST “flavors” allow alignments of protein and DNA in all different combinations. • Relatively fast and sensitive making BLAST the standard tool for searching large datasets using sequence similarity. • Ubiquitous - Virtually all online protein resources have some kind of BLAST implementation. • NCBI may have the best on-line version of the tool. • http://www.ncbi.nlm.nih.gov/blast/

Global Sequence Alignment (MSA) Portion of a multiple, global alignment created with ClustalX The goal is to stack in columns amino acids that derive from an ancestral residue. The quality of pairwise and groupwise alignments are scored using substitution matrices.

Protein Substitution Matrices # Matrix made by matblas from blosum62.iij # * column uses minimum score # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62 # Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 BLOSUM62 Both Local and Global alignments use substitution matrices to quantify relationships between proteins.

Phylogenetic Trees • Clustal uses the Neighbor-Joining Method (NJ) • NJ is a distance-based method that repeatedly groups the 2 most closely related sequences. • The Phylip package is freely available and implements a wide range of different methods. http://evolution.genetics.washington.edu/phylip.html • Tree Reliability • The bootstrap method is used to add confidence levels to the groupings. • Visualization of the tree • NJ Plot • Draws unrooted phylogenetic trees in phenogram format • Other methods allow more control of format for example: http://iubio.bio.indiana.edu/treeapp/treeprint-form.html

Assessing Tree Reliability using Bootstrapping X X Actual Alignment: A Bootstrap Replicate: • Positions within the original alignment are randomly resampled to create a “pseudo replicate”. • Large numbers of pseudo replicates are generated. • The distances between species within each pseudo replicate are calculated and trees are drawn for each. • The stability of clades within the sets are calculated to identify clades that are present in most pseudo replicates.

Phylogenetic Tree Examples ITAL_HUMAN ITA2_HUMAN ( ( ( ITA1_DROME:0.67741, ( ITA6_HUMAN:0.42032, ITA7_HUMAN:0.31161) :0.29176[1000]) :0.11947[992], ( ITA2_DROME:0.72000, ( ITA5_HUMAN:0.37147, ITAV_HUMAN:0.43034) :0.25993[1000]) :0.09502[976]) :0.12118[954], ( ITA5_DROME:1.07810, ( ( ITA10_HUMAN:0.70421, ITA2_HUMAN:0.73710) :0.10612[857], ITAL_HUMAN:0.86603) :0.18550[986]) :0.02936[434], ( ITA4_HUMAN:0.49064, ITA9_HUMAN:0.45807) :0.35160[1000]); 857 ITA10_HUMAN ITA9_HUMAN 1000 ITA4_HUMAN 986 ITAV_HUMAN ITA5_HUMAN 1000 976 ITA2_DROME ITA7_HUMAN 954 ITA6_HUMAN 1000 992 ITA1_DROME ITA5_DROME

Homolog, Ortholog and Paralog A Ancestral Organism Speciation Event Orthologs xA yA Gene Duplication Paralogs Homologs xA yA’ yA’’ • There is no such thing as percent homology. • When there is any doubt, use the term homolog. • How do you identify homologs?

Protein Domains and Motifs • Protein domains are modular units of sequence with consistent structure and function. • Evolution can produce both new domains and novel combinations of domains. • Protein motifs are short sequence patterns with functional implications. Pan-Bilaterian Subgroup B Thrombospondin Deuterostome-specific Subgroup A Thrombospondin CSVTCG CD36-Binding Motif

Protein Domain and Motif Analysis • Models (HMMs) that describe domains are created from alignments. Those models are then used to scan proteins for the presence of domains. • Domains do not need to be characterized or understood to be detected (DUFs). • Motifs are analyzed in a similar way or using simpler methods involving text pattern matching. • Proteins in public databases have already been analyzed for domain content and these data are available from a number of sources.

SMART - http://smart.embl-heidelberg.de/ • SMART is an excellent resource for domain analysis • Integrates data from multiple sources • SMART and pfam domain models • Gene Ontology • Taxonomic data • Genomic data (Ensembl) • Powerful Search Tools • Excellent Graphics

Interproscan - http://www.ebi.ac.uk/InterProScan/ • Includes some of the things found in SMART plus additional models and methods. • Software and data are freely available allowing batch analysis of proteins on local computers.

Scansite - http://scansite.mit.edu/ • Search tool designed to identify substrates of a variety of protein kinases. • Other useful utilities are also available

Functional Annotation of Proteins • Database records • Literature • GO (http://www.geneontology.org/)“The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism. [...] The GO project has developed three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.”GO identifiers are now cross-referenced in many biological databases, including Uniprot and SMART. Available information and nomenclature used dependent on researcher, research focus, species etc.

Introduction to Protein Sequence Analysis - Charlie Whittaker/Sebastian Hoersch

Introduction to Protein Sequence Analysis - Charlie Whittaker/Sebastian Hoersch

Presentation Transcript

Biomolecular Modeling: building a 3D protein structure from its sequence

Protein Homology Modelling

Bioinformatics and sequence analysis

Job Analysis

Sequence Analysis, Pair Wise Alignment, and Database Searching

Protein 3D-structure analysis

Nuclear Magnetic Resonance (NMR) Data Protein–Protein Docking

SYSC 3100 - System Analysis and Design

The UniProt knowledgebase www.uniprot.org a hub of integrated protein data

Complex networks are found throughout biology

Protein metabolism

Lecture 4 Protein Function prediction using network concepts Hierarchical Clustering

1-month Practical Course Genome Analysis Lecture 5: Multiple Sequence Alignment

Protein interactions and Pathways

Protein Structure

Chapter 5 Analysis of CCS

Sequence Analysis

The Genetic Code, Mutations, and Translation

Complex networks are found throughout biology

Introduction to SRS

Protein Chemistry Basics

Some topics in Bioinformatics: An introduction 1, Primary mathematical statistics