introduction to protein sequence analysis charlie whittaker sebastian hoersch l.
Download
Skip this Video
Download Presentation
Introduction to Protein Sequence Analysis - Charlie Whittaker/Sebastian Hoersch

Loading in 2 Seconds...

play fullscreen
1 / 15

Introduction to Protein Sequence Analysis - Charlie Whittaker/Sebastian Hoersch - PowerPoint PPT Presentation


  • 135 Views
  • Uploaded on

Introduction to Protein Sequence Analysis - Charlie Whittaker/Sebastian Hoersch. Course Materials: http://luria.mit.edu/ICBPsessions Topics Lecture/Demo/Exercises Accessing Protein Sequences and Information (Uniprot) Sequence Alignment and Phylogenetic Analysis (BLAST, ClustalX)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Introduction to Protein Sequence Analysis - Charlie Whittaker/Sebastian Hoersch' - daxia


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction to protein sequence analysis charlie whittaker sebastian hoersch
Introduction to Protein Sequence Analysis - Charlie Whittaker/Sebastian Hoersch
  • Course Materials:

http://luria.mit.edu/ICBPsessions

  • Topics
    • Lecture/Demo/Exercises
      • Accessing Protein Sequences and Information (Uniprot)
      • Sequence Alignment and Phylogenetic Analysis (BLAST, ClustalX)
      • Protein Domain and Motif Analysis (SMART, Interpro, Scansite)
  • Evaluations

send email to charliew@mit.edu

accessing protein sequences and information
Accessing Protein Sequences and Information

The large number of different databases and resources can make this difficult.

  • Different resources:
    • contain different data
    • use different identifier schemes
    • use different definitions of redundancy
  • Ensembl (genomes), NCBI protein (genbank), IPI and UniProt.
  • UniProt may be the best place to begin.
    • Useful X_Y ID scheme
    • Widespread Usage (SMART, GO)
    • Abundant manual annotation and cross-referencing tools
    • Database is mirrored at multiple locations

UniProt: http://www.pir.uniprot.org/

local sequence alignment blast
Local Sequence Alignment (BLAST)
  • Searching is done in a pairwise fashion and reported alignments are restricted to the best parts of the query-target relationship.
  • Multiple BLAST “flavors” allow alignments of protein and DNA in all different combinations.
  • Relatively fast and sensitive making BLAST the standard tool for searching large datasets using sequence similarity.
  • Ubiquitous - Virtually all online protein resources have some kind of BLAST implementation.
  • NCBI may have the best on-line version of the tool.
  • http://www.ncbi.nlm.nih.gov/blast/
global sequence alignment msa
Global Sequence Alignment (MSA)

Portion of a multiple, global alignment created with ClustalX

The goal is to stack in columns amino acids that derive from an ancestral residue. The quality of pairwise and groupwise alignments are scored using substitution matrices.

protein substitution matrices
Protein Substitution Matrices

# Matrix made by matblas from blosum62.iij

# * column uses minimum score

# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units

# Blocks Database = /data/blocks_5.0/blocks.dat

# Cluster Percentage: >= 62

# Entropy = 0.6979, Expected = -0.5209

A R N D C Q E G H I L K M F P S T W Y V B Z X *

A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4

R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4

N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4

D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4

C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4

Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4

E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4

G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4

H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4

B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4

Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4

* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

BLOSUM62

Both Local and Global alignments use substitution matrices to quantify

relationships between proteins.

phylogenetic trees
Phylogenetic Trees
  • Clustal uses the Neighbor-Joining Method (NJ)
    • NJ is a distance-based method that repeatedly groups the 2 most closely related sequences.
    • The Phylip package is freely available and implements a wide range of different methods.

http://evolution.genetics.washington.edu/phylip.html

  • Tree Reliability
    • The bootstrap method is used to add confidence levels to the groupings.
  • Visualization of the tree
    • NJ Plot
      • Draws unrooted phylogenetic trees in phenogram format
    • Other methods allow more control of format

for example:

http://iubio.bio.indiana.edu/treeapp/treeprint-form.html

assessing tree reliability using bootstrapping
Assessing Tree Reliability using Bootstrapping

X

X

Actual Alignment:

A Bootstrap Replicate:

  • Positions within the original alignment are randomly resampled to create a “pseudo replicate”.
  • Large numbers of pseudo replicates are generated.
  • The distances between species within each pseudo replicate are calculated and trees are drawn for each.
  • The stability of clades within the sets are calculated to identify clades that are present in most pseudo replicates.
phylogenetic tree examples
Phylogenetic Tree Examples

ITAL_HUMAN

ITA2_HUMAN

(

(

(

ITA1_DROME:0.67741,

(

ITA6_HUMAN:0.42032,

ITA7_HUMAN:0.31161)

:0.29176[1000])

:0.11947[992],

(

ITA2_DROME:0.72000,

(

ITA5_HUMAN:0.37147,

ITAV_HUMAN:0.43034)

:0.25993[1000])

:0.09502[976])

:0.12118[954],

(

ITA5_DROME:1.07810,

(

(

ITA10_HUMAN:0.70421,

ITA2_HUMAN:0.73710)

:0.10612[857],

ITAL_HUMAN:0.86603)

:0.18550[986])

:0.02936[434],

(

ITA4_HUMAN:0.49064,

ITA9_HUMAN:0.45807)

:0.35160[1000]);

857

ITA10_HUMAN

ITA9_HUMAN

1000

ITA4_HUMAN

986

ITAV_HUMAN

ITA5_HUMAN

1000

976

ITA2_DROME

ITA7_HUMAN

954

ITA6_HUMAN

1000

992

ITA1_DROME

ITA5_DROME

homolog ortholog and paralog
Homolog, Ortholog and Paralog

A

Ancestral Organism

Speciation Event

Orthologs

xA

yA

Gene Duplication

Paralogs

Homologs

xA

yA’

yA’’

  • There is no such thing as percent homology.
  • When there is any doubt, use the term homolog.
  • How do you identify homologs?
protein domains and motifs
Protein Domains and Motifs
  • Protein domains are modular units of sequence with consistent structure and function.
  • Evolution can produce both new domains and novel combinations of domains.
  • Protein motifs are short sequence patterns with functional implications.

Pan-Bilaterian Subgroup B Thrombospondin

Deuterostome-specific Subgroup A Thrombospondin

CSVTCG

CD36-Binding Motif

protein domain and motif analysis
Protein Domain and Motif Analysis
  • Models (HMMs) that describe domains are created from alignments. Those models are then used to scan proteins for the presence of domains.
  • Domains do not need to be characterized or understood to be detected (DUFs).
  • Motifs are analyzed in a similar way or using simpler methods involving text pattern matching.
  • Proteins in public databases have already been analyzed for domain content and these data are available from a number of sources.
smart http smart embl heidelberg de
SMART - http://smart.embl-heidelberg.de/
  • SMART is an excellent resource for domain analysis
    • Integrates data from multiple sources
      • SMART and pfam domain models
      • Gene Ontology
      • Taxonomic data
      • Genomic data (Ensembl)
    • Powerful Search Tools
    • Excellent Graphics
interproscan http www ebi ac uk interproscan
Interproscan - http://www.ebi.ac.uk/InterProScan/
  • Includes some of the things found in SMART plus additional models and methods.
  • Software and data are freely available allowing batch analysis of proteins on local computers.
scansite http scansite mit edu
Scansite - http://scansite.mit.edu/
  • Search tool designed to identify substrates of a variety of protein kinases.
  • Other useful utilities are also available
functional annotation of proteins
Functional Annotation of Proteins
  • Database records
  • Literature
  • GO (http://www.geneontology.org/)“The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism. [...] The GO project has developed three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.”GO identifiers are now cross-referenced in many biological databases, including Uniprot and SMART.

Available information and nomenclature

used dependent on researcher,

research focus, species etc.

ad