1 / 37

# profiles and multiple sequence alignments - PowerPoint PPT Presentation

Profiles and multiple Sequence alignments. Understanding Bioinformatics 9 th KIAS winter school Lee, Juyong. Contents. Defining profile PSSM by PSI-BLAST Profile HMM Aligning profiles PSSM & Profile HMM Generate multiple sequence alignment Progressive Other methods. What is Profile?.

Related searches for profiles and multiple sequence alignments

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'profiles and multiple sequence alignments' - omer

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Profiles and multiple Sequence alignments

Understanding Bioinformatics

9th KIAS winter school

Lee, Juyong

• Defining profile

• PSSM by PSI-BLAST

• Profile HMM

• Aligning profiles

• PSSM & Profile HMM

• Generate multiple sequence alignment

• Progressive

• Other methods

• Represent general properties of the set of sequences

• A set of sequences contains more information than a single sequence

• Environment is being considered

• Two types

• Position Specific Scoring Matrix

• Profile Hidden Markov Model

\$> blastpgp -b 0 -j 3 -h 0.001 -d myDB –I mySEQ.fasta –Q myPSSM.mtx –o myMSA.bla

Are K, I and S are meaningful?

Are A & T are meaningless?

K, I and S are highly conserved!

T at the sixth column is also conserved

2nd and 4th columns do not show preference

K-IAS--

KAI-ST-

K-I-ST-

KRISS--

K-I-STI

K-IAS-

KAI-ST

Log-odds score of amino acid a at position u

Multiple sequence alignment

Lack of information should be treated!

Not Good !

If a is not observed, m  -∞

Generating PSSM (2)Pseudo-counts

: fraction of amino acid a at position u

: amino acid a distribution

α & β are scaling parameters

Generating PSSM (3)More realistic pseudocounts

Use substitution matrix information rather than random alignment!

Pseudo count of amino acid a

F : frequency of amino acid b at u

Formula used in PSI-BLAST

• Goal : Find sequence homologs!

• First, perform regular BLAST local search

• Build PSSM based on the first round result

• Align sequences against PSSM

• Update sequence alignment!

• Do these iteratively!

• Represent general property of a set of sequences based on Hidden Markov Model

0.4

0.1

0.1

0.6

0.5

0.7

0.4

0.2

0.7

0.3

0.6

Emit Amino acid

KIA-S-

K-AIST

KI--ST

KIA-S-

K-AIST

D1

D2

D3

D4

Start

M1

M2

M3

M4

END

I0

I1

I2

I3

I4

A

S

K

T

I

KIA-S-

K-AIST

KI--ST

KIAS-

KI-ST

D1

D2

D3

D4

Start

M1

M2

M3

M4

END

I0

I1

I2

I3

I4

I

S

T

K

Transition probability between states

Amino acid emission probability

• Many parameters to be trained

• Transition probabilities ~ Nseq * 9

• Amino acid emission probabilities ~ Nseq * 20

• For 100 residue seq,

• ~3000 parameters to be tuned

• Generally at least 20~30 related sequences are required to build accurate profile HMM

Many possible paths! We need to score them……

QUERY : KRISS

D1

D1

D2

D2

D3

D3

D4

D4

Start

M1

M2

M3

M4

END

Start

M1

M2

M3

M4

END

I0

I1

I2

I3

I4

I0

I1

I2

I3

I4

S

S

R

I

I

S

S

K

K

R

• Two ways of evaluating fitness of a sequence to profile HMM

• Through the Most probable path

• Viterbi algorithm

• Faster, less accurate

• Consider all possible paths !

• Forward ( Backward ) algorithm

Slower, more accurate

Equivalent to the dynamic programming of pairwise alignment

• Consider all possible path !

Probability of emitting xi at state Su

• Profile  General property of a query sequence derived from a set of related sequences

• Position specific Scoring Matrix

• Profile Hidden Markov Model

• Can find remote sequence homolog

• Those can not be detected by pairwise alignment of sequences

• Comparing PSSM

• LAMA : no gaps allowed, use Pearson correlation of scores

• Prof_sim : gaps allowed, use amino acid distribution at each column

• COMPASS : gaps allowed, psuedocounts are used as similar to PSI-BLAST

COACH, HHsearch are available

Can find very remote homologs

Position dependent gap scoring is possible

• DP of Pairwise is easy and applicable

• Only three cases

• If three sequences……

• Seven cases……

• For six sequences……

• 60TB memory required

• DP is Impossible 

A

A

A

-

A

A

A

V

V

-

V

-

-

V

L

-

-

L

• Progressive method

• Add a sequence at a time

• ClustalW, T-COFFEE, etc.

• Iterative method

• Deletion, realigning steps are introduced

• Prrp, DIALIGN, MUSCLE and etc.

Case 1

Let’s align the followings

--D-G-D

D-G-D

--G-G--

G-G

D-G-G

D-G-G--

Case 2

D-G-G

G-G

D-G-D

Build phylogenic tree based on all pairwise distance matrix

Which MSA is better?-Scoring scheme

Usually Sum of Pairs are used

• ClustalW

• Similar to schemes for pairwise alignment

• Employ residue-specific gap opening

• T-COFFEE

• Score if aligned column is present in the Library

• Diverse alignment

• Local & Global

Different Weights for individual columns

• Construct whole alignment from ungapped local alignments

• Find all ungapped alignments and weight them !

• Key Idea : pairwise alignment can miss biologically important region

• Genetic Algorithm

• Alignment  generation

• Evolve through mutation & Crossover