Week 08
1 / 64

Applied Bioinformatics - PowerPoint PPT Presentation

  • Uploaded on

Week 08. Applied Bioinformatics. Theory I. Protein Sequences Protein Families Protein Domains Computer Learning Garbage in -> Garbage out Prediction based on learned Examples. Protein Sequence. Primary Sequence consisting of 20 amino acids Secondary Structure consists of 3 types

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Applied Bioinformatics' - soyala

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Week 08

Week 08

Applied Bioinformatics

Theory i
Theory I

  • Protein Sequences

  • Protein Families

  • Protein Domains

  • Computer Learning

    • Garbage in -> Garbage out

  • Prediction based on learned Examples

Protein sequence
Protein Sequence

  • Primary Sequence consisting of 20 amino acids

  • Secondary Structure consists of 3 types

    • Helix – Strand – Coil

  • Tertiary structure Combinations of secondary structures

    • Unlimited number of combinations possible

    • But limited number of motives found

    • Architectures are build hierarchicaly

  • Quaternary structure

    • AKA protein-protein interactions are not part of this course





  • The PRIDE PRoteomics IDEntifications database is a centralized, standards compliant, public data repository for proteomics data

  • It contains experimental evidence for its entries

  • http://www.ebi.ac.uk/pride//

Protein sequences
Protein Sequenceshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Swissprot = UniProtKB

    • http://www.expasy.ch/sprot

    • http://www.ebi.ac.uk/swissprot/

  • As in Genebank for nucleotide sequences we need a unique identifier for each protein sequence

  • Let’s look at EBI now


  • The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. (KB: Knowledge Base)

  • Often manually reviewed and annotated information


Including splice

variants and isoforms

Protein information
Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Clicking on the Member name (Accession Number)

will provide detailed information about the protein

Protein information1
Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Protein information2
Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Protein information3
Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Protein information4
Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Protein information5
Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Protein information6
Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Protein information7
Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Machine learning
Machine Learninghttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • For example clustering

    • UniRef90

    • UniRef50


  • Many Facts -> Rules/Knowledge

  • Learning = Deducing rules from facts

  • Computer/Machine learning?

    • Same idea

Computer learning
Computer Learninghttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Neural Networks

  • Support Vector Machines

  • Naive Bayes Classifiers

  • Self Organizing Maps

  • Decision Trees

  • And many other algorithms


  • Training data needs to be chosen carefully

    • Example sub cellular targeting of proteins

  • What needs to be predicted?

    • Localization

    • Leader peptide cleavage site

  • Where does the data come from

    • Best would be sequences validated by experimental results

  • How many?

    • Difficult to answer this one

    • More is good, but rare events will not be learned well

    • Better is manual editing choosing many possibilities and not over representing some of them in the dataset


  • Yes! preparing the dataset is crucial and takes most of the time

  • Applying the learner will not take long

  • All outcomes of the samples need to be known (target, cleavage site)

    • Negative examples are just as important

  • Divide the dataset into two parts

    • One will be used for learning

    • The other for validating the learned rules


  • The dataset can be automatically divided into different training and validation sets

  • This can be performed many times and the best result (rule set) can later be used to predict new sequences

  • That’s machine learning in brief

  • We just touched the surface of it

Classification general idee
Classification General Ideehttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Practical considerations
Practical Considerationshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • You want to predict the sub cellular target of a protein

    • Which species are you working with?

    • Which species did the training data come from?

    • You can try a few known examples

  • Read the publication

    • How precise is the prediction

    • For localization

    • For prediction of the leader peptide

  • If possible, try different approaches

Clustering machine learning
Clustering (Machine Learning)http://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Basically same idea as in MSA

    • Similar sequences are aligned first

    • Similar datasets are clustered first

  • The initial clusters are combined into super clusters (hierarchical clustering)

    • Similar to forming a guide tree

  • New measurements can be assigned to known clusters

    • Information can be inferred

Protein families
Protein Familieshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Based on

    • Clusters of protein sequences

    • Domains (basically blocks of above)

  • Many domains are annotated

    • Good place to find these is

    • http://www.ebi.ac.uk/InterProScan

Practice i
Practice Ihttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Protein information8
Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • In many cases we would like to get additional information about a protein

    • Molecular mass

    • pI

    • Subcellular targeting

  • http://www.expasy.org/tools

    • Many calculations, etc for proteins

Tools at expasy
Tools at Expasyhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Prediction/ Characterizing Tools

  • Pattern and Profile searches

  • PTM predictions

  • Topology Prediction

  • Structure

    • Primary (Analysis)

    • Secondary (Prediction)

    • Tertiary (Prediction, Analysis)

Protein information9
Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm


  • You want to predict the sub cellular localization of a protein

Let s tackle this problem
Let’s tackle this problemhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Get a protein from swissprot

    • O82533 (Gene: AtFtsZ2-1)

  • Annotation: Chloroplast targeting

  • Try a few prediction tools to see if you can confirm the annotation

Localization prediction
Localization Predictionhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Choose tools from Expasy for example

  • ChloroP

  • SignalP

  • Predotar

Theory ii
Theory IIhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Substitution Matrices

First substitution matrices
First Substitution Matriceshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Substitution Matrices

  • Sequence relationships may be hidden by changes in sequence

    • Mutations

    • Evolution

  • Approximate matches are needed

Selectionist model
Selectionist Modelhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Some mutations are neutral

    • Not disturbing the function much

    • Not disturbing the structure much

  • These accumulate over time (evolution)

  • Some mutations are disruptive

    • L <> Q

    • Frameshift insertions or deletions

More elaborate matrices
More elaborate Matriceshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Format

    • Table 20 X 20

    • Probability of change for each combination

    • Symmetric

    • 190 distinct entries + 20

  • Examples

    • Unitary

    • GCM

    • BLOSUM

    • PAM

Genetic code matrix
Genetic Code Matrixhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Considers the minimum number of base changes (0,1,2,3)

  • Are amino acids different in only one base chemically significantly different?

  • Not a very good matrix

    • Although mutation on the genetic level

    • Selection is on the protein level

  • A priori

  • Example

    • Jukes Cantor Model

Amino acid substitutions
Amino Acid Substitutionshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • A priori

    • driven by amino acid properties

      • Size

      • Hydrophobicity

      • Charge

      • ...

  • Determined from example

Pam matrices
PAM matriceshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Percent Accepted Mutation: Unit of evolutionary change for protein sequences [Dayhoff78].

  • A PAM unit is the amount of evolution that will on average change 1% of the amino acids within a protein sequence.

Pam matrices assumptions
PAM matrices: Assumptionshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Only mutations are allowed

  • Sites evolve independently

  • Evolution at each site occurs according to a simple (“first-order”) Markov process

    • Next mutation depends only on current state and is independent of previous mutations

  • Mutation probabilities are given by a substitution matrixM = [mXY], where mxy = Prob(X Y mutation) = Prob(Y|X)

The pam family
The PAM Familyhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Define a family of substitution matrices — PAM 1, PAM 2, etc. — where PAM n is used to compare sequences at distance n PAM.

PAM n = (PAM 1)n

Do not confuse with scoring matrices!

Scoring matrices are derived from PAM matrices to yield log-odds scores.

Generating pam matrices
Generating PAM matriceshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Idea: Find amino acids substitution statistics by comparing evolutionarily close sequences that are highly similar

    • Easier than for distant sequences, since only few insertions and deletions took place.

  • Computing PAM 1 (Dayhoff’s approach):

    • Start with highly similar aligned sequences, with known evolutionary trees (71 trees total).

    • Collect substitution statistics (1572 exchanges total).

    • Let mij= observed frequency (= estimated probability) of amino acid Aimutating into amino acid Ajduring one PAM unit

    • Result: a 20× 20 real matrix where columns add up to 1.

Dayhoff s pam matrix
Dayhoff’s PAM matrixhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

All entries  104

Calculate a substitution frequency matrix
Calculate a substitution frequency matrixhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Pam250 log ods
PAM250http://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm (log ods)

Blosum matrices
BLOSUM matriceshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff & Henikoff92].

  • For example BLOSUM62 is derived from sequence alignments with no more than 62% identity.

Blosum scoring matrices
BLOSUM Scoring Matriceshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • BLOck SUbstitution Matrix

  • Based on comparisons of blocks of sequences derived from the Blocks database

  • The Blocks database contains multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins (local alignment versus global alignment)

  • BLOSUM matrices are derived from blocks whose alignment corresponds to the BLOSUM-,matrix number

Conserved blocks in alignments
Conserved blocks in alignmentshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm







Constructing blosum r
Constructing BLOSUM http://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htmr

  • To avoid bias in favor of a certain protein, first eliminate sequences that are more than r% identical

  • The elimination is done by either

    • removing sequences from the block, or

    • finding a cluster of similar sequences and replacing it by a new sequence that represents the cluster.

  • BLOSUM r is the matrix built from blocks with no more the r% of similarity

    • E.g., BLOSUM62 is the matrix built using sequences with no more than 62% similarity.

    • Note: BLOSUM 62 is the default matrix for protein BLAST


  • PAM is based on an evolutionary model using phylogenetic trees

  • BLOSUM assumes no evolutionary model, but rather conserved “blocks” of proteins

Equivalent pam and blossum matrices according to h
Equivalent PAM and Blossum matrices (according to http://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htmH)

  • PAM100 ==> Blosum90

  • PAM120 ==> Blosum80

  • PAM160 ==> Blosum60

  • PAM200 ==> Blosum52

  • PAM250 ==> Blosum45

Pam versus blosum
PAM versus Blosumhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Source: http://www.csit.fsu.edu/~swofford/bioinformatics_spring05/lectures/lecture03-blosum.pdf

Practice ii
Practice IIhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm


  • Advantages

    • Meta predictor

    • Uses many tools

      • BlastProDom

      • HMMPfam

      • HMMTigr

    • Returns all results for your analysis



  • http://www.ebi.ac.uk/InterProScan

Which sequence
Which Sequencehttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Go to swissprot

  • Select a sequence of interest

    • E.g.: a translocase

    • Should have some annotated function

  • Paste the FASTA sequence

  • Run InterProScan

Interproscan results
InterProScan Resultshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm


  • First glance

    • The colored highlights over the sequence are domains

    • The more the merrier

    • None? New protein? Be happy!

  • Boxes refer to a record in InterPro database

  • The IPR… link summarizes the results


  • Look at the IPR summary, if any

  • Select Table: For all matching proteins

  • Select the FASTA option on the following page

  • Add your original sequence to the FASTA coll.

  • Make an MSA

Protein domains
Protein Domainshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • Use the same sequence

    • Do the same analysis using NCBI CD server

    • http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

  • NCBI may have domains that InterScanPro doesn’t have and vice versa

Cd server
CD Serverhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

See the difference
See the differencehttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

More to test
More to testhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

  • http://www.expasy.org/tools/#pattern

    • Try

      • Hits

      • HamapScan

      • SMART

      • ScanProsite (if time allows)