Download Presentation
Bioinformatics

Loading in 2 Seconds...

1 / 44

# Bioinformatics - PowerPoint PPT Presentation

Lecture 6 Sequence Alignment. Bioinformatics. Dr. Aladdin Hamwieh Khalid Al- shamaa Abdulqader Jighly. Aleppo University Faculty of technical engineering Department of Biotechnology. 2010-2011. Gene prediction: Methods. Gene Prediction can be based upon: Coding statistics

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

## PowerPoint Slideshow about ' Bioinformatics' - rafer

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Lecture 6

Sequence Alignment

### Bioinformatics

Dr. Aladdin Hamwieh Khalid Al-shamaa

Abdulqader Jighly

Aleppo University

Faculty of technical engineering

Department of Biotechnology

2010-2011

Gene prediction: Methods
• Gene Prediction can be based upon:
• Coding statistics
• Gene structure
• Comparison

Statistical approach

Similarity-based approach

Gene prediction: Methods
• Gene Prediction can be based upon:
• Coding statistics
• Gene structure
• Comparison

Statistical approach

Similarity-based approach

Alignment
• Sequence alignment involves the identification of the correct location of deletions and insertions that have occurred in either of the two lineages since their divergence from a common ancestor.
• Dynamic programming is the standard approach to sequence alignment
• Global alignment: optimize the overall similarity of the two sequences
• Local alignment: find only relatively conserved subsequences
• Pairwise alignment: is the alignment between two sequences
• Multiple alignment: is the alignment between more than two sequences
Methods of alignment:
• Dot matrix
• Distance Matrix
Dot Plot Algorithm
• Take two sequences (A & B), write sequence A out as a row (length=m) and sequence B as a column (length =n)
• Create a table or “matrix” of “m” columns and “n” rows
• Compare each letter of sequence A with every letter in sequence B. If there’s a match mark it with a dot, if not, leave blank
Dot Plot Algorithm

A C D E F G H G G

A

C

D

E

F

G

H

G

A

Complete identity

X

Not Matched

The vertical gap indicates that a coding region corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

Advantages:

Highlighting Information

Advantages:

Highlighting Information

The two pairs of diagonally oriented parallel lines most probably indicate that two small internal duplications occurred in the bacterial gene.

Scoring Matrices
• Scoring matrices are created based on biological evidence.
• To generalize scoring, consider a (4+1) x (4+1) scoring matrixδ.
• In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size.
• The addition of 1 is to include the score for comparison of a gap character “-”.
Scoring Matrice Elements

Input: two sequences over the same alphabet

Output: an alignment of the two sequences

Example:

• GCGCATGGATTGAGCGAandTGCGCCATTGATGACCA
• A possible alignment:

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Three elements:

• Perfect matches
• Mismatches
• Insertions & deletions (indel)
scoring scheme

A G C T -

A +1 –1 –1 -1 -2

G –1 +1 –1 -1 -2

C –1 –1 +1 -1 -2

T –1 –1 –1 +1 -2

- -2 -2 -2 -2 *

Score each position independently:

• Match: +1
• Mismatch: -1
• Indel: -2

Score of an alignment is sum of position scores

Example:-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Score: (+1x13) + (-1x2) + (-2x4) = 3

------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Score:(+1x5) + (-1x6) + (-2x11)= -23

Transition and Transversion
• Matrix Example:

A C G T

A +3 –2 –1 -2

C –2 +3 –2 -1

G –1 –2 +3 -2

T –2 –1 –2 +3

The Global Alignment Problem

Find the best alignment between two strings under a given scoring schema

Input : Strings v and w and a scoring schema

Output : Alignment of maximum score

↑← = -б

= 1 if match

= -µ if mismatch

si-1,j-1 +1 if vi = wj

si,j= max si-1,j-1 -µ if vi ≠ wj

si-1,j - σ

si,j-1 - σ

W

Wj-1Wj

m : mismatch penalty

σ : indelpenalty

V

ViVi-1

{

Longest Common Subsequences – Practice 1
• Mismatches are not allowed (μ = -∞)
• No indels penalties (σ = 0)
• and matches are rewarded with +1
• V = ATCTGAT
• W = TGCAT
Longest Common Subsequences – Practice 10
• Computing similarity s(V,W) = 4
• Computing distance d(V,W) = n + m – 2 s(V,M) = 5
Longest Common Subsequences – Practice 10
• Alignment: – T G C A T – A – A T – C – T G A T
Protein Substitution Matrix

Identity Scoring Matrix

Percent Accepted Mutation (PAM)

Blocks Substitution Matrix (BLOSUM)

Percent Accepted Mutation (PAM)
• 1 PAM is the amount of evolutionary change that yields, on average, one substitution in 100 amino acid residues.
• PAM250 matrix assumes/is optimized for sequences separated by 250 PAM, i.e. 250 substitutions in 100 amino acids (longer evolutionary time)
• To derive a mutational probability matrix for a protein sequence that has undergone N percent accepted mutations, a PAM-N matrix, the PAM-1 matrix is multiplied by itself N times
• PAM250 is suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing more closely related sequences.
Selecting a PAM Matrix
• Low PAM numbers: short sequences, strong local similarities.
• High PAM numbers: long sequences, weak similarities.
• PAM60 for close relations (60% identity)
• PAM120 recommended for general use (40% identity)
• PAM250 for distant relations (20% identity)
• If uncertain, try several different matrices
• PAM40, PAM120, PAM250 recommended.
BLOSUM:BlocksSubstitutionMatrix
• Based on BLOCKS database
• ~2000 blocks from 500 families of related proteins
• Families of proteins with identical function
• Blocks are short conserved patterns of 3-60 amino acid long without gaps
• Each block represent sequences alignment with different identity percentage

AABCDA … BBCDA

DABCDA. A. BBCBB

BBBCDABA.BCCAA

AAACDAC.DCBCDB

CCBADAB.DBBDCC

AAACAA … BBCCC

BLOSUM Matrices
• For each block the amino-acid substitution rates were calculated to create BLOSUM matrix
• Different BLOSUMn matrices are calculated independently from BLOCKS
• BLOSUMn is based on sequences that shared at least n percent identical
• BLOSUM62 represents closer sequences than BLOSUM45
Selecting a BLOSUM Matrix
• For BLOSUMn, higher n suitable for sequences which are more similar
• BLOSUM62 recommended for general use
• BLOSUM80 for close relations
• BLOSUM45 for distant relations

Equivalent PAM and Blosum matricesThe following matrices are roughly equivalent...

• PAM100 Blosum90
• PAM120 Blosum80
• PAM160 Blosum60
• PAM200 Blosum52
• PAM250 Blosum45Generally speaking...
• The Blosum matrices are best for detecting local alignments.
• The Blosum62 matrix is the best for detecting the majority of weak protein similarities.
• The Blosum45 matrix is the best for detecting long and weak alignments.

Less divergent

More divergent

Common amino acids have low weights

Rare amino acids have high weights

BLOSUM62

A4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X

Positive for more likely substitution

BLOSUM62

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X

Negative for less likely substitution

alignment score

A4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X

…PQG…

…PQG…

7+5+6

=18

..PQG..

..PEG..

7+2+6

=15

…PQG…

…PQA…

7+5+0

=12

This is more likely

This is less likely

Affine Gap Penalties
• In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events:

ATA__GC

ATATTGC

ATAG_GC

AT_GTGC

Normal scoring would give the same score for both alignments

Accounting for Gaps
• Gaps- contiguous sequence of spaces in one of the rows
• Score for a gap of length x is:

-(ρ +σx)

where ρ >0 is the penalty for introducing a gap:

gap opening penalty

ρ will be large relative to σ:

gap extension penalty

because you do not want to add too much of a penalty for extending the gap.

Multiple Sequence Alignment
• All sequences are compared to each other (pairwise alignments)
• A dendrogram (like a phylogenetic tree) is constructed, describing the approximate groupings of the sequences by similarity (stored in a file).
• The final multiple alignment is carried out, using the dendrogram as a guide.