Bioinformatics
Sponsored Links
This presentation is the property of its rightful owner.
1 / 44

Bioinformatics PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on
  • Presentation posted in: General

Lecture 6 Sequence Alignment. Bioinformatics. Dr. Aladdin HamwiehKhalid Al- shamaa Abdulqader Jighly. Aleppo University Faculty of technical engineering Department of Biotechnology. 2010-2011. Gene prediction: Methods. Gene Prediction can be based upon: Coding statistics

Download Presentation

Bioinformatics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lecture 6

Sequence Alignment

Bioinformatics

Dr. Aladdin HamwiehKhalid Al-shamaa

Abdulqader Jighly

Aleppo University

Faculty of technical engineering

Department of Biotechnology

2010-2011


Gene prediction: Methods

  • Gene Prediction can be based upon:

    • Coding statistics

    • Gene structure

    • Comparison

Statistical approach

Similarity-based approach


Gene prediction: Methods

  • Gene Prediction can be based upon:

    • Coding statistics

    • Gene structure

    • Comparison

Statistical approach

Similarity-based approach


Alignment

  • Sequence alignment involves the identification of the correct location of deletions and insertions that have occurred in either of the two lineages since their divergence from a common ancestor.

  • Dynamic programming is the standard approach to sequence alignment

  • Global alignment: optimize the overall similarity of the two sequences

  • Local alignment: find only relatively conserved subsequences

  • Pairwise alignment: is the alignment between two sequences

  • Multiple alignment: is the alignment between more than two sequences


Methods of alignment:

  • Dot matrix

  • Distance Matrix


Dot Plot Algorithm

  • Take two sequences (A & B), write sequence A out as a row (length=m) and sequence B as a column (length =n)

  • Create a table or “matrix” of “m” columns and “n” rows

  • Compare each letter of sequence A with every letter in sequence B. If there’s a match mark it with a dot, if not, leave blank


Dot Plot Algorithm

A C D E F G H G G

A

C

D

E

F

G

H

G

A

Complete identity

X

Not Matched


Dot Plots & Internal Repeats


The vertical gap indicates that a coding region corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

Advantages:

Highlighting Information


Advantages:

Highlighting Information

The two pairs of diagonally oriented parallel lines most probably indicate that two small internal duplications occurred in the bacterial gene.


Scoring Matrices

  • Scoring matrices are created based on biological evidence.

  • To generalize scoring, consider a (4+1) x (4+1) scoring matrixδ.

  • In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size.

  • The addition of 1 is to include the score for comparison of a gap character “-”.


Scoring Matrice Elements

Input: two sequences over the same alphabet

Output: an alignment of the two sequences

Example:

  • GCGCATGGATTGAGCGAandTGCGCCATTGATGACCA

  • A possible alignment:

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Three elements:

  • Perfect matches

  • Mismatches

  • Insertions & deletions (indel)


scoring scheme

A G C T -

A +1 –1 –1 -1 -2

G –1 +1 –1 -1 -2

C –1 –1 +1 -1 -2

T –1 –1 –1 +1 -2

- -2 -2 -2 -2 *

Score each position independently:

  • Match: +1

  • Mismatch: -1

  • Indel: -2

    Score of an alignment is sum of position scores

Example:-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Score: (+1x13) + (-1x2) + (-2x4) = 3

------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Score:(+1x5) + (-1x6) + (-2x11)= -23


Transition and Transversion

  • Matrix Example:

A C G T

A +3 –2 –1 -2

C –2 +3 –2 -1

G –1 –2 +3 -2

T –2 –1 –2 +3


The Global Alignment Problem

Find the best alignment between two strings under a given scoring schema

Input : Strings v and w and a scoring schema

Output : Alignment of maximum score

↑← = -б

= 1 if match

= -µ if mismatch

si-1,j-1 +1 if vi = wj

si,j= max si-1,j-1 -µ if vi ≠ wj

si-1,j - σ

si,j-1 - σ

W

Wj-1Wj

m : mismatch penalty

σ : indelpenalty

V

ViVi-1

{


Longest Common Subsequences – Practice 1

  • Mismatches are not allowed (μ = -∞)

  • No indels penalties (σ = 0)

  • and matches are rewarded with +1

  • V = ATCTGAT

  • W = TGCAT


Longest Common Subsequences – Practice 2


Longest Common Subsequences – Practice 3


Longest Common Subsequences – Practice 4


Longest Common Subsequences – Practice 5


Longest Common Subsequences – Practice 6


Longest Common Subsequences – Practice 7


Longest Common Subsequences – Practice 8


Longest Common Subsequences – Practice 9


Longest Common Subsequences – Practice 10

  • Computing similarity s(V,W) = 4

  • Computing distance d(V,W) = n + m – 2 s(V,M) = 5


Longest Common Subsequences – Practice 10

  • Alignment:– T G C A T – A –A T – C – T G A T


Protein Substitution Matrix

Identity Scoring Matrix

Percent Accepted Mutation (PAM)

Blocks Substitution Matrix (BLOSUM)


Identity Scoring Matrix


Percent Accepted Mutation (PAM)

  • 1 PAM is the amount of evolutionary change that yields, on average, one substitution in 100 amino acid residues.

  • PAM250 matrix assumes/is optimized for sequences separated by 250 PAM, i.e. 250 substitutions in 100 amino acids (longer evolutionary time)

  • To derive a mutational probability matrix for a protein sequence that has undergone N percent accepted mutations, a PAM-N matrix, the PAM-1 matrix is multiplied by itself N times

  • PAM250 is suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing more closely related sequences.


Selecting a PAM Matrix

  • Low PAM numbers: short sequences, strong local similarities.

  • High PAM numbers: long sequences, weak similarities.

    • PAM60 for close relations (60% identity)

    • PAM120 recommended for general use (40% identity)

    • PAM250 for distant relations (20% identity)

  • If uncertain, try several different matrices

    • PAM40, PAM120, PAM250 recommended.


A Better Matrix - PAM250


BLOSUM:BlocksSubstitutionMatrix

  • Based on BLOCKS database

    • ~2000 blocks from 500 families of related proteins

    • Families of proteins with identical function

  • Blocks are short conserved patterns of 3-60 amino acid long without gaps

  • Each block represent sequences alignment with different identity percentage

AABCDA … BBCDA

DABCDA. A. BBCBB

BBBCDABA.BCCAA

AAACDAC.DCBCDB

CCBADAB.DBBDCC

AAACAA … BBCCC


BLOSUM Matrices

  • For each block the amino-acid substitution rates were calculated to create BLOSUM matrix

  • Different BLOSUMn matrices are calculated independently from BLOCKS

  • BLOSUMn is based on sequences that shared at least n percent identical

  • BLOSUM62 represents closer sequences than BLOSUM45


Selecting a BLOSUM Matrix

  • For BLOSUMn, higher n suitable for sequences which are more similar

    • BLOSUM62 recommended for general use

    • BLOSUM80 for close relations

    • BLOSUM45 for distant relations


  • Equivalent PAM and Blosum matricesThe following matrices are roughly equivalent...

  • PAM100 Blosum90

  • PAM120 Blosum80

  • PAM160 Blosum60

  • PAM200 Blosum52

  • PAM250 Blosum45Generally speaking...

  • The Blosum matrices are best for detecting local alignments.

  • The Blosum62 matrix is the best for detecting the majority of weak protein similarities.

  • The Blosum45 matrix is the best for detecting long and weak alignments.

Less divergent

More divergent


Common amino acids have low weights

Rare amino acids have high weights

BLOSUM62

A4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X


BLOSUM62

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X

Positive for more likely substitution


BLOSUM62

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X

Negative for less likely substitution


alignment score

A4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X

…PQG…

…PQG…

7+5+6

=18

..PQG..

..PEG..

7+2+6

=15

…PQG…

…PQA…

7+5+0

=12


This is more likely

This is less likely

Affine Gap Penalties

  • In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events:

ATA__GC

ATATTGC

ATAG_GC

AT_GTGC

Normal scoring would give the same score for both alignments


Accounting for Gaps

  • Gaps- contiguous sequence of spaces in one of the rows

  • Score for a gap of length x is:

    -(ρ +σx)

    where ρ >0 is the penalty for introducing a gap:

    gap opening penalty

    ρ will be large relative to σ:

    gap extension penalty

    because you do not want to add too much of a penalty for extending the gap.


Multiple Sequence Alignment

  • All sequences are compared to each other (pairwise alignments)

  • A dendrogram (like a phylogenetic tree) is constructed, describing the approximate groupings of the sequences by similarity (stored in a file).

  • The final multiple alignment is carried out, using the dendrogram as a guide.


Applications of multiple alignments


Thank you


  • Login