Bioinformatics
Download
1 / 44

Bioinformatics - PowerPoint PPT Presentation


  • 63 Views
  • Uploaded on

Lecture 6 Sequence Alignment. Bioinformatics. Dr. Aladdin Hamwieh Khalid Al- shamaa Abdulqader Jighly. Aleppo University Faculty of technical engineering Department of Biotechnology. 2010-2011. Gene prediction: Methods. Gene Prediction can be based upon: Coding statistics

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Bioinformatics' - rafer


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Bioinformatics

Lecture 6

Sequence Alignment

Bioinformatics

Dr. Aladdin Hamwieh Khalid Al-shamaa

Abdulqader Jighly

Aleppo University

Faculty of technical engineering

Department of Biotechnology

2010-2011


Gene prediction methods
Gene prediction: Methods

  • Gene Prediction can be based upon:

    • Coding statistics

    • Gene structure

    • Comparison

Statistical approach

Similarity-based approach


Gene prediction methods1
Gene prediction: Methods

  • Gene Prediction can be based upon:

    • Coding statistics

    • Gene structure

    • Comparison

Statistical approach

Similarity-based approach


Alignment
Alignment

  • Sequence alignment involves the identification of the correct location of deletions and insertions that have occurred in either of the two lineages since their divergence from a common ancestor.

  • Dynamic programming is the standard approach to sequence alignment

  • Global alignment: optimize the overall similarity of the two sequences

  • Local alignment: find only relatively conserved subsequences

  • Pairwise alignment: is the alignment between two sequences

  • Multiple alignment: is the alignment between more than two sequences


Methods of alignment
Methods of alignment:

  • Dot matrix

  • Distance Matrix


Dot plot algorithm
Dot Plot Algorithm

  • Take two sequences (A & B), write sequence A out as a row (length=m) and sequence B as a column (length =n)

  • Create a table or “matrix” of “m” columns and “n” rows

  • Compare each letter of sequence A with every letter in sequence B. If there’s a match mark it with a dot, if not, leave blank


Dot plot algorithm1
Dot Plot Algorithm

A C D E F G H G G

A

C

D

E

F

G

H

G

A

Complete identity

X

Not Matched



The vertical gap indicates that a coding region corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

Advantages:

Highlighting Information


Advantages: corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

Highlighting Information

The two pairs of diagonally oriented parallel lines most probably indicate that two small internal duplications occurred in the bacterial gene.


Scoring matrices
Scoring Matrices corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

  • Scoring matrices are created based on biological evidence.

  • To generalize scoring, consider a (4+1) x (4+1) scoring matrixδ.

  • In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size.

  • The addition of 1 is to include the score for comparison of a gap character “-”.


Scoring matrice elements
Scoring corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Matrice Elements

Input: two sequences over the same alphabet

Output: an alignment of the two sequences

Example:

  • GCGCATGGATTGAGCGAandTGCGCCATTGATGACCA

  • A possible alignment:

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Three elements:

  • Perfect matches

  • Mismatches

  • Insertions & deletions (indel)


Scoring scheme
scoring corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. scheme

A G C T -

A +1 –1 –1 -1 -2

G –1 +1 –1 -1 -2

C –1 –1 +1 -1 -2

T –1 –1 –1 +1 -2

- -2 -2 -2 -2 *

Score each position independently:

  • Match: +1

  • Mismatch: -1

  • Indel: -2

    Score of an alignment is sum of position scores

Example:-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Score: (+1x13) + (-1x2) + (-2x4) = 3

------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Score:(+1x5) + (-1x6) + (-2x11)= -23


Transition and transversion
Transition and corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Transversion

  • Matrix Example:

A C G T

A +3 –2 –1 -2

C –2 +3 –2 -1

G –1 –2 +3 -2

T –2 –1 –2 +3


The global alignment problem
The Global Alignment Problem corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

Find the best alignment between two strings under a given scoring schema

Input : Strings v and w and a scoring schema

Output : Alignment of maximum score

↑← = -б

= 1 if match

= -µ if mismatch

si-1,j-1 +1 if vi = wj

si,j= max si-1,j-1 -µ if vi ≠ wj

si-1,j - σ

si,j-1 - σ

W

Wj-1Wj

m : mismatch penalty

σ : indelpenalty

V

ViVi-1

{


Longest common subsequences practice 1
Longest Common Subsequences – corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Practice 1

  • Mismatches are not allowed (μ = -∞)

  • No indels penalties (σ = 0)

  • and matches are rewarded with +1

  • V = ATCTGAT

  • W = TGCAT


Longest common subsequences practice 2
Longest Common Subsequences – corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Practice 2


Longest common subsequences practice 3
Longest Common Subsequences – corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Practice 3


Longest common subsequences practice 4
Longest Common Subsequences – corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Practice 4


Longest common subsequences practice 5
Longest Common Subsequences – corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Practice 5


Longest common subsequences practice 6
Longest Common Subsequences – corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Practice 6


Longest common subsequences practice 7
Longest Common Subsequences – corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Practice 7


Longest common subsequences practice 8
Longest Common Subsequences – corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Practice 8


Longest common subsequences practice 9
Longest Common Subsequences – corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Practice 9


Longest common subsequences practice 10
Longest Common Subsequences – corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Practice 10

  • Computing similarity s(V,W) = 4

  • Computing distance d(V,W) = n + m – 2 s(V,M) = 5


Longest common subsequences practice 101
Longest Common Subsequences – corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Practice 10

  • Alignment: – T G C A T – A – A T – C – T G A T


Protein substitution matrix
Protein Substitution Matrix corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

Identity Scoring Matrix

Percent Accepted Mutation (PAM)

Blocks Substitution Matrix (BLOSUM)


Identity scoring matrix
Identity Scoring corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Matrix


Percent accepted mutation pam
Percent Accepted Mutation (PAM) corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

  • 1 PAM is the amount of evolutionary change that yields, on average, one substitution in 100 amino acid residues.

  • PAM250 matrix assumes/is optimized for sequences separated by 250 PAM, i.e. 250 substitutions in 100 amino acids (longer evolutionary time)

  • To derive a mutational probability matrix for a protein sequence that has undergone N percent accepted mutations, a PAM-N matrix, the PAM-1 matrix is multiplied by itself N times

  • PAM250 is suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing more closely related sequences.


Selecting a pam matrix
Selecting a PAM Matrix corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

  • Low PAM numbers: short sequences, strong local similarities.

  • High PAM numbers: long sequences, weak similarities.

    • PAM60 for close relations (60% identity)

    • PAM120 recommended for general use (40% identity)

    • PAM250 for distant relations (20% identity)

  • If uncertain, try several different matrices

    • PAM40, PAM120, PAM250 recommended.


A better matrix pam250
A Better Matrix - PAM250 corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.


Blosum blo cks su bstitution m atrix
BLOSUM: corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. BlocksSubstitutionMatrix

  • Based on BLOCKS database

    • ~2000 blocks from 500 families of related proteins

    • Families of proteins with identical function

  • Blocks are short conserved patterns of 3-60 amino acid long without gaps

  • Each block represent sequences alignment with different identity percentage

AABCDA … BBCDA

DABCDA. A. BBCBB

BBBCDABA.BCCAA

AAACDAC.DCBCDB

CCBADAB.DBBDCC

AAACAA … BBCCC


Blosum matrices
BLOSUM Matrices corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

  • For each block the amino-acid substitution rates were calculated to create BLOSUM matrix

  • Different BLOSUMn matrices are calculated independently from BLOCKS

  • BLOSUMn is based on sequences that shared at least n percent identical

  • BLOSUM62 represents closer sequences than BLOSUM45


Selecting a blosum matrix
Selecting a BLOSUM Matrix corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

  • For BLOSUMn, higher n suitable for sequences which are more similar

    • BLOSUM62 recommended for general use

    • BLOSUM80 for close relations

    • BLOSUM45 for distant relations


  • Equivalent PAM and corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Blosum matricesThe following matrices are roughly equivalent...

  • PAM100 Blosum90

  • PAM120 Blosum80

  • PAM160 Blosum60

  • PAM200 Blosum52

  • PAM250 Blosum45Generally speaking...

  • The Blosum matrices are best for detecting local alignments.

  • The Blosum62 matrix is the best for detecting the majority of weak protein similarities.

  • The Blosum45 matrix is the best for detecting long and weak alignments.

Less divergent

More divergent


Blosum62

Common amino acids have low weights corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

Rare amino acids have high weights

BLOSUM62

A4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X


Blosum621
BLOSUM62 corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X

Positive for more likely substitution


Blosum622
BLOSUM62 corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X

Negative for less likely substitution


A lignment s core
a corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. lignment score

A4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X

…PQG…

…PQG…

7+5+6

=18

..PQG..

..PEG..

7+2+6

=15

…PQG…

…PQA…

7+5+0

=12


Affine gap penalties

This is corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. more likely

This is less likely

Affine Gap Penalties

  • In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events:

ATA__GC

ATATTGC

ATAG_GC

AT_GTGC

Normal scoring would give the same score for both alignments


Accounting for gaps
Accounting for Gaps corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

  • Gaps- contiguous sequence of spaces in one of the rows

  • Score for a gap of length x is:

    -(ρ +σx)

    where ρ >0 is the penalty for introducing a gap:

    gap opening penalty

    ρ will be large relative to σ:

    gap extension penalty

    because you do not want to add too much of a penalty for extending the gap.


Multiple sequence alignment
Multiple Sequence Alignment corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

  • All sequences are compared to each other (pairwise alignments)

  • A dendrogram (like a phylogenetic tree) is constructed, describing the approximate groupings of the sequences by similarity (stored in a file).

  • The final multiple alignment is carried out, using the dendrogram as a guide.


Applications of multiple alignments corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.


Thank you
Thank you corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.


ad