Profiles and multiple sequence alignments
Download
1 / 37

profiles and multiple sequence alignments - PowerPoint PPT Presentation


  • 317 Views
  • Updated On :

Profiles and multiple Sequence alignments. Understanding Bioinformatics 9 th KIAS winter school Lee, Juyong. Contents. Defining profile PSSM by PSI-BLAST Profile HMM Aligning profiles PSSM & Profile HMM Generate multiple sequence alignment Progressive Other methods. What is Profile?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'profiles and multiple sequence alignments' - omer


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Profiles and multiple sequence alignments l.jpg

Profiles and multiple Sequence alignments

Understanding Bioinformatics

9th KIAS winter school

Lee, Juyong


Contents l.jpg
Contents

  • Defining profile

    • PSSM by PSI-BLAST

    • Profile HMM

  • Aligning profiles

    • PSSM & Profile HMM

  • Generate multiple sequence alignment

    • Progressive

    • Other methods


What is profile l.jpg
What is Profile?

  • Represent general properties of the set of sequences

    • A set of sequences contains more information than a single sequence

    • Environment is being considered

  • Two types

    • Position Specific Scoring Matrix

    • Profile Hidden Markov Model



Position specific scoring matrix l.jpg
Position specific scoring matrix

$> blastpgp -b 0 -j 3 -h 0.001 -d myDB –I mySEQ.fasta –Q myPSSM.mtx –o myMSA.bla


A set of sequences has more information l.jpg
A set of sequences has more information

Are K, I and S are meaningful?

Are A & T are meaningless?

K, I and S are highly conserved!

T at the sixth column is also conserved

2nd and 4th columns do not show preference

K-IAS--

KAI-ST-

K-I-ST-

KRISS--

K-I-STI

K-IAS-

KAI-ST


Generating pssm l.jpg
Generating PSSM

Log-odds score of amino acid a at position u

Multiple sequence alignment

Lack of information should be treated!

Not Good !

If a is not observed, m  -∞


Generating pssm 2 pseudo counts l.jpg
Generating PSSM (2)Pseudo-counts

: fraction of amino acid a at position u

: amino acid a distribution

α & β are scaling parameters


Generating pssm 3 more realistic pseudocounts l.jpg
Generating PSSM (3)More realistic pseudocounts

Use substitution matrix information rather than random alignment!

Pseudo count of amino acid a

F : frequency of amino acid b at u

Formula used in PSI-BLAST



Psi blast is sequence db searching program l.jpg
PSI-BLAST is sequence DB searching program

  • Goal : Find sequence homologs!

  • First, perform regular BLAST local search

  • Build PSSM based on the first round result

  • Align sequences against PSSM

  • Update sequence alignment!

  • Do these iteratively!



Profile hmm l.jpg
Profile HMM

  • Represent general property of a set of sequences based on Hidden Markov Model

0.4

0.1

0.1

0.6

0.5

0.7

0.4

0.2

0.7

0.3

0.6

Emit Amino acid


Profile hmm 2 l.jpg
Profile HMM (2)

KIA-S-

K-AIST

KI--ST

KIA-S-

K-AIST

D1

D2

D3

D4

Start

M1

M2

M3

M4

END

I0

I1

I2

I3

I4

A

S

K

T

I


Profile hmm 3 l.jpg
Profile HMM (3)

KIA-S-

K-AIST

KI--ST

KIAS-

KI-ST

D1

D2

D3

D4

Start

M1

M2

M3

M4

END

I0

I1

I2

I3

I4

I

S

T

K


Estimate probabilities l.jpg
Estimate probabilities

Transition probability between states

Amino acid emission probability


Profile hmm requires a lot of data l.jpg
Profile HMM requires a lot of data

  • Many parameters to be trained

    • Transition probabilities ~ Nseq * 9

    • Amino acid emission probabilities ~ Nseq * 20

    • For 100 residue seq,

      • ~3000 parameters to be tuned

    • Generally at least 20~30 related sequences are required to build accurate profile HMM


Many possible paths we need to score them l.jpg
Many possible paths! We need to score them……

QUERY : KRISS

D1

D1

D2

D2

D3

D3

D4

D4

Start

M1

M2

M3

M4

END

Start

M1

M2

M3

M4

END

I0

I1

I2

I3

I4

I0

I1

I2

I3

I4

S

S

R

I

I

S

S

K

K

R


How to score a sequence to profile hmm l.jpg
How to score a sequence to profile HMM

  • Two ways of evaluating fitness of a sequence to profile HMM

    • Through the Most probable path

      • Viterbi algorithm

      • Faster, less accurate

    • Consider all possible paths !

      • Forward ( Backward ) algorithm

        Slower, more accurate


Viterbi algorithm l.jpg
Viterbi algorithm

Equivalent to the dynamic programming of pairwise alignment


Forward algorithm l.jpg
Forward algorithm

  • Consider all possible path !

Probability of emitting xi at state Su


Summary l.jpg
Summary

  • Profile  General property of a query sequence derived from a set of related sequences

  • Position specific Scoring Matrix

  • Profile Hidden Markov Model

  • Can find remote sequence homolog

    • Those can not be detected by pairwise alignment of sequences


Aligning profiles l.jpg
Aligning Profiles

  • Comparing PSSM

    • LAMA : no gaps allowed, use Pearson correlation of scores

    • Prof_sim : gaps allowed, use amino acid distribution at each column

    • COMPASS : gaps allowed, psuedocounts are used as similar to PSI-BLAST


Aligning profile hmms l.jpg
Aligning profile HMMs

COACH, HHsearch are available

Can find very remote homologs

Position dependent gap scoring is possible



Why msa is difficult l.jpg
Why MSA is difficult?

  • DP of Pairwise is easy and applicable

  • Only three cases

  • If three sequences……

    • Seven cases……

  • For six sequences……

    • 60TB memory required

    • DP is Impossible 

A

A

A

-

A

A

A

V

V

-

V

-

-

V

L

-

-

L


Methods to align sequences l.jpg
Methods to align sequences

  • Progressive method

    • Add a sequence at a time

    • ClustalW, T-COFFEE, etc.

  • Iterative method

    • Deletion, realigning steps are introduced

    • Prrp, DIALIGN, MUSCLE and etc.


Order is important l.jpg
Order is important!

Case 1

Let’s align the followings

--D-G-D

D-G-D

--G-G--

G-G

D-G-G

D-G-G--

Case 2

D-G-G

G-G

D-G-D


Determine order l.jpg
Determine order !

Build phylogenic tree based on all pairwise distance matrix


Which msa is better scoring scheme l.jpg
Which MSA is better?-Scoring scheme

Usually Sum of Pairs are used


Scores l.jpg
Scores

  • ClustalW

    • Similar to schemes for pairwise alignment

    • Employ residue-specific gap opening


Scores 2 l.jpg
Scores (2)

  • T-COFFEE

    • Score if aligned column is present in the Library

    • Diverse alignment

      • Local & Global


Library extension of t coffee l.jpg
Library Extension of T-COFFEE

Different Weights for individual columns


Other methods dialign l.jpg
Other methods - DIALIGN

  • Construct whole alignment from ungapped local alignments

  • Find all ungapped alignments and weight them !

  • Key Idea : pairwise alignment can miss biologically important region


Other methods saga l.jpg
Other methods - SAGA

  • Genetic Algorithm

  • Alignment  generation

  • Evolve through mutation & Crossover




ad