- 110 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Applied Bioinformatics' - soyala

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Week 08

Applied BioinformaticsTheory I

- Protein Sequences
- Protein Families
- Protein Domains
- Computer Learning
- Garbage in -> Garbage out

- Prediction based on learned Examples

Protein Sequence

- Primary Sequence consisting of 20 amino acids
- Secondary Structure consists of 3 types
- Helix – Strand – Coil

- Tertiary structure Combinations of secondary structures
- Unlimited number of combinations possible
- But limited number of motives found
- Architectures are build hierarchicaly

- Quaternary structure
- AKA protein-protein interactions are not part of this course

http://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htmhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

http://www.usermeds.com/medications/amino-acids

http://www.weightlossandnutritionsecrets.com/all-about-amino-acids/

PRIDEhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- The PRIDE PRoteomics IDEntifications database is a centralized, standards compliant, public data repository for proteomics data
- It contains experimental evidence for its entries
- http://www.ebi.ac.uk/pride//

Protein Sequenceshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Swissprot = UniProtKB
- http://www.expasy.ch/sprot
- http://www.ebi.ac.uk/swissprot/

- As in Genebank for nucleotide sequences we need a unique identifier for each protein sequence
- Let’s look at EBI now

UniProtKBhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. (KB: Knowledge Base)
- Often manually reviewed and annotated information

UniProthttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Including splice

variants and isoforms

Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Clicking on the Member name (Accession Number)

will provide detailed information about the protein

Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Machine Learninghttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- For example clustering
- UniRef90
- UniRef50

Learninghttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Many Facts -> Rules/Knowledge
- Learning = Deducing rules from facts
- Computer/Machine learning?
- Same idea

Computer Learninghttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Neural Networks
- Support Vector Machines
- Naive Bayes Classifiers
- Self Organizing Maps
- Decision Trees
- And many other algorithms

Datahttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Training data needs to be chosen carefully
- Example sub cellular targeting of proteins

- What needs to be predicted?
- Localization
- Leader peptide cleavage site

- Where does the data come from
- Best would be sequences validated by experimental results

- How many?
- Difficult to answer this one
- More is good, but rare events will not be learned well
- Better is manual editing choosing many possibilities and not over representing some of them in the dataset

Datahttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Yes! preparing the dataset is crucial and takes most of the time
- Applying the learner will not take long
- All outcomes of the samples need to be known (target, cleavage site)
- Negative examples are just as important

- Divide the dataset into two parts
- One will be used for learning
- The other for validating the learned rules

Validationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- The dataset can be automatically divided into different training and validation sets
- This can be performed many times and the best result (rule set) can later be used to predict new sequences
- That’s machine learning in brief
- We just touched the surface of it

Classification General Ideehttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Practical Considerationshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- You want to predict the sub cellular target of a protein
- Which species are you working with?
- Which species did the training data come from?
- You can try a few known examples

- Read the publication
- How precise is the prediction
- For localization
- For prediction of the leader peptide

- If possible, try different approaches

Clustering (Machine Learning)http://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Basically same idea as in MSA
- Similar sequences are aligned first
- Similar datasets are clustered first

- The initial clusters are combined into super clusters (hierarchical clustering)
- Similar to forming a guide tree

- New measurements can be assigned to known clusters
- Information can be inferred

Protein Familieshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Based on
- Clusters of protein sequences
- Domains (basically blocks of above)

- Many domains are annotated
- Good place to find these is
- http://www.ebi.ac.uk/InterProScan

Practice Ihttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- In many cases we would like to get additional information about a protein
- Molecular mass
- pI
- Subcellular targeting

- http://www.expasy.org/tools
- Many calculations, etc for proteins

Tools at Expasyhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Prediction/ Characterizing Tools
- Pattern and Profile searches
- PTM predictions
- Topology Prediction
- Structure
- Primary (Analysis)
- Secondary (Prediction)
- Tertiary (Prediction, Analysis)

- …

Protein Informationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Localizationhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- You want to predict the sub cellular localization of a protein

Let’s tackle this problemhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Get a protein from swissprot
- O82533 (Gene: AtFtsZ2-1)

- Annotation: Chloroplast targeting
- Try a few prediction tools to see if you can confirm the annotation

Localization Predictionhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Choose tools from Expasy for example
- ChloroP
- SignalP
- Predotar

Theory IIhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Substitution Matrices

First Substitution Matriceshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Substitution Matrices
- Sequence relationships may be hidden by changes in sequence
- Mutations
- Evolution

- Approximate matches are needed

Selectionist Modelhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Some mutations are neutral
- Not disturbing the function much
- Not disturbing the structure much

- These accumulate over time (evolution)
- Some mutations are disruptive
- L <> Q
- Frameshift insertions or deletions

More elaborate Matriceshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Format
- Table 20 X 20
- Probability of change for each combination
- Symmetric
- 190 distinct entries + 20

- Examples
- Unitary
- GCM
- BLOSUM
- PAM

Genetic Code Matrixhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Considers the minimum number of base changes (0,1,2,3)
- Are amino acids different in only one base chemically significantly different?
- Not a very good matrix
- Although mutation on the genetic level
- Selection is on the protein level

- A priori
- Example
- Jukes Cantor Model

Amino Acid Substitutionshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- A priori
- driven by amino acid properties
- Size
- Hydrophobicity
- Charge
- ...

- driven by amino acid properties
- Determined from example

PAM matriceshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Percent Accepted Mutation: Unit of evolutionary change for protein sequences [Dayhoff78].
- A PAM unit is the amount of evolution that will on average change 1% of the amino acids within a protein sequence.

PAM matrices: Assumptionshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Only mutations are allowed
- Sites evolve independently
- Evolution at each site occurs according to a simple (“first-order”) Markov process
- Next mutation depends only on current state and is independent of previous mutations

- Mutation probabilities are given by a substitution matrixM = [mXY], where mxy = Prob(X Y mutation) = Prob(Y|X)

The PAM Familyhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Define a family of substitution matrices — PAM 1, PAM 2, etc. — where PAM n is used to compare sequences at distance n PAM.

PAM n = (PAM 1)n

Do not confuse with scoring matrices!

Scoring matrices are derived from PAM matrices to yield log-odds scores.

Generating PAM matriceshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Idea: Find amino acids substitution statistics by comparing evolutionarily close sequences that are highly similar
- Easier than for distant sequences, since only few insertions and deletions took place.

- Computing PAM 1 (Dayhoff’s approach):
- Start with highly similar aligned sequences, with known evolutionary trees (71 trees total).
- Collect substitution statistics (1572 exchanges total).
- Let mij= observed frequency (= estimated probability) of amino acid Aimutating into amino acid Ajduring one PAM unit
- Result: a 20× 20 real matrix where columns add up to 1.

Dayhoff’s PAM matrixhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

All entries 104

Calculate a substitution frequency matrixhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

PAM250http://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm (log ods)

BLOSUM matriceshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff & Henikoff92].
- For example BLOSUM62 is derived from sequence alignments with no more than 62% identity.

BLOSUM Scoring Matriceshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- BLOck SUbstitution Matrix
- Based on comparisons of blocks of sequences derived from the Blocks database
- The Blocks database contains multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins (local alignment versus global alignment)
- BLOSUM matrices are derived from blocks whose alignment corresponds to the BLOSUM-,matrix number

Conserved blocks in alignmentshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

AABCDA...BBCDA

DABCDA.A.BBCBB

BBBCDABA.BCCAA

AAACDAC.DCBCDB

CCBADAB.DBBDCC

AAACAA...BBCCC

Constructing BLOSUM http://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htmr

- To avoid bias in favor of a certain protein, first eliminate sequences that are more than r% identical
- The elimination is done by either
- removing sequences from the block, or
- finding a cluster of similar sequences and replacing it by a new sequence that represents the cluster.

- BLOSUM r is the matrix built from blocks with no more the r% of similarity
- E.g., BLOSUM62 is the matrix built using sequences with no more than 62% similarity.
- Note: BLOSUM 62 is the default matrix for protein BLAST

Comparisonhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- PAM is based on an evolutionary model using phylogenetic trees
- BLOSUM assumes no evolutionary model, but rather conserved “blocks” of proteins

Equivalent PAM and Blossum matrices (according to http://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htmH)

- PAM100 ==> Blosum90
- PAM120 ==> Blosum80
- PAM160 ==> Blosum60
- PAM200 ==> Blosum52
- PAM250 ==> Blosum45

PAM versus Blosumhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

Source: http://www.csit.fsu.edu/~swofford/bioinformatics_spring05/lectures/lecture03-blosum.pdf

Practice IIhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

InterProScanhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Advantages
- Meta predictor
- Uses many tools
- BlastProDom
- HMMPfam
- HMMTigr
- …

- Returns all results for your analysis

http://genome.cshlp.org/content/12/1/47/F1.expansion

InterProScanhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- http://www.ebi.ac.uk/InterProScan

Which Sequencehttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Go to swissprot
- Select a sequence of interest
- E.g.: a translocase
- Should have some annotated function

- Paste the FASTA sequence
- Run InterProScan

InterProScan Resultshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

So?http://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- First glance
- The colored highlights over the sequence are domains
- The more the merrier
- None? New protein? Be happy!

- Boxes refer to a record in InterPro database
- The IPR… link summarizes the results

Summaryhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Look at the IPR summary, if any
- Select Table: For all matching proteins
- Select the FASTA option on the following page
- Add your original sequence to the FASTA coll.
- Make an MSA

Protein Domainshttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- Use the same sequence
- Do the same analysis using NCBI CD server
- http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

- NCBI may have domains that InterScanPro doesn’t have and vice versa

CD Serverhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

See the differencehttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

More to testhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm

- http://www.expasy.org/tools/#pattern
- Try
- Hits
- HamapScan
- SMART
- ScanProsite (if time allows)

- Try

Download Presentation

Connecting to Server..