1 / 107

Some topics in Bioinformatics: An introduction 1, Primary mathematical statistics - PowerPoint PPT Presentation

Some topics in Bioinformatics: An introduction 1, Primary mathematical statistics 2, Sequence motif discovery 2.1 Imbedded Markov chain 2.2 Search for patterns by alignment 3, HMM for protein secondary structure prediction 4, Protein conformational letters and structure alignment

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about ' Some topics in Bioinformatics: An introduction 1, Primary mathematical statistics ' - teagan-vazquez

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

An introduction

1, Primary mathematical statistics

2, Sequence motif discovery

2.1 Imbedded Markov chain

2.2 Search for patterns by alignment

3, HMM for protein secondary structure prediction

4, Protein conformational letters and structure alignment

4.1 Conformational letters and substitution matrix

4.2 Fast pair structure alignment

4.3 Fast multiple structure alignment

An Introduction

Primary Mathematical Statistics

Mathematics: given cause, `calculate' outcomes.

Science: given outcomes + available prior knowledge, infer possible causes.

Mathematical statistics:

Draw a conclusion from inconclusive data by plausible reasoning.

‘Probability and statistics’ unfortunate dichotomy:

Probability makes sense, but is just a game; statistics, a bewildering collection of tests, is useful, but has little reason.

It doesn't have to be like this.

The Bayesian approach uses only probability theory, and makes the topic of statistics entirely superfluous. It provides the logic justification for many of the prevalent statistical tests, making explicit the conditions and approximations implicitly assumed in their use.

Nb. = NbA + NbB, N.A = NbA + NgA, N = Nb. + Ng. = N.A + N.B.

P(bA) = NbA / N, P(b) = Nb. / N, P(b) + P(g) = 1,

P(A|b) = NbA / Nb.,

P(bA) = P(b)*P(A|b).

P(b) = P(bA) + P(bB).

Sum rule:

Product rule:

I: the relevant background information at hand.

Marginalization

is powerful device in dealing with nuisance parameters, which are necessary but of no intrinsic interest (e.g. background signal).

Bayes' theorem

posterior likelihood prior

By simple reinterpreting X as Model or Hypothesis and Y as Data, the formula becomes

which forms the foundation for data analysis. In some situations, such as model selection, the omitted denominator plays a crucial role, so is given the special name of evidence.

Parameter estimation:

Example: Student Data, the formula becomest-distribution and -distribution

The MaxEnt distribution under the restriction of given mean and variance is the (Gaussian) normal distribution

Maximal entropy = maximal uncertainty in prerequisite or prior conditions.

Suppose { xi }are samples from a normal distribution with unknown parameters mand s. The likelihood function is

Simple flat prior

gives the marginal posterior P({x}|m,s) P(m,s) / P({x})

The estimate of parameter mis

Introducing the decomposition

implies

which is a t-distribution. Another marginal posterior

is a -distribution.

Example decomposition: DNA sequence viewed as a outcome of die tossing

A Nucleotide Die is a tetrahedron: {a, c, g, t}. The probability of observing a nucleotide a is p(a) = pa. Under the independent nonidentical distribution model, the probability to observe sequence s1s2…sn is

P(s1s2…sn | p) = p(s1)p(s2)…p(sn) =

Long repeats

Suppose the probability for nucleotide a is p. The length Y of a repeat of a has the geometric distribution

Thus, for the longest Ymax in n repeats, we have

For an N long sequence, it is estimated that n = (1-p)N. We have the

P-value = Prob(Ymax ≥ ymax)

This discussion can be also used for word match between two sequences.

Entropies decomposition

Shannon entropy:

Relative entropy

distance of {p} from {q}

The equality is valid iff P and Q are identical. H measures the distance between P and Q. A related form (as leading terms of H) is

Conditional entropy

Mutual information (to measure independency or correlation)

Maximal entropy ⇔ Minimal likelihood a relative entropy

Maximal uncertainty to `unknown' (premise); Minimal uncertainty to `known' (outcome).

Replace a single data point (d-function) by a kernel function:

where d is the dimensionality, and kernel K satisfies some conditions. The Gaussian kernel is

For covariance S the distance is one of Mahalanobis; for identity S it is Euclidean. The optimal scale factor h may be fixed by minimize r or maximize m:

Pseudocounts viewed as a ‘kernel’.

Sampling equilibrium distribution decomposition

The global balance equation and the detail balance condition are

There are two sets of probabilities: visiting probability from old state i to new j, and acceptance probability to accept j.

Hastings-Metropolis Algorithm uses

For Metropolis is assumed while for Gibbs

Jensen inequality decomposition

For a random variable X and a continuous convex function g(x) on (a; b), we have

Proof: Suppose that and g’(x0) = k. From the convexity we have

which, by taking to be leads to

By taking expectation, this yields the inequality.

Discriminant analysis and cluster analysis

Supervised learning and unsupervised learning

Example: Reduction of amino acid alphabet decomposition

We are interested in the mutual information I(C;a) between conformation and amino acids

The difference between values of I after and before clustering is given by

which, by introducing

and defining is proportional to

with f(x) = x*log x. From the Jensen theorem, for the convex function x log x here we have , so I never increases after any step of clustering.

(Conditional) probabilities represent decompositionlogical connections rather than causal ones. Example of Jaynes (1989): Given a bag of balls, five red and seven green. When a ball is selected at `random', the probability of drawing a red is 5/12. If the first ball is not returned to the bag, then it seems reasonable that the probability of obtaining a red or green ball on the second draw will depend on the outcome of the first. Now suppose that we are not told the outcome of the first draw, but are given the result of the second draw. Does

the probability of the first draw being red or green change with the knowledge of the second? The answer ‘no' is incorrect. The error becomes obvious in the extreme case of one red and one green.

Probability as a degree-of-belief or plausibility; or a long-run relative frequency? Randomness is central to orthodox statistics and uncertainty inherent in Bayesian theory.

Necessity and contingency / certainty and randomness

Deterministic and stochastic

Deterministic and predictable

Motif Discovery decomposition

Wei-Mou Zheng

Introduction decomposition

Identifying motifs or signals is a very common task.

Patterns: Shed light on structure and function, used for classifying

An example: the transcription regulatory sites (TFBSs).

Transcription factors and their binding sites (TFBS)

gene regulatory proteins

RNA polymerase

TATA box

Transcription start site (TSS)

• Such motifs are decomposition

• generally short,

• quite variable, and

• often dispersed over a large distance.

• Two ways for representing motifs:

• the consensus sequence, or

• a position-specific weight matrix (PSWM)

• Profile HMM

• Methods for Motif Discovery

• Consensus: Counting & evaluating

• PSWM: Alignment & motif matrix updating

• Imbedded Markov chain (IMC) decomposition

• Consensus:

• TATA-box:TATAAATA, Leucine zipper:L-X(6)-L-X(6)-L-X(6)-L

• Counting & evaluating

• Motifs are identified as over- or under-represented oligomers.

• Count all possible DNA words of a certain length in an input set and then use statistics to evaluate over-represented words.

• Compare observed (in input set) with expected (in a background model)

• possible k-tuples increases exponentially

• limited to simple, short motifs with a highly conserved core.

• complication due to overlaps of words.

• Independence approximation: ignore the correction of auto-correlation of pattern overlap, assume motifs occur independently at each position.

• DNA alphabet size is only 4.

• Simple patterns such as AAAA are easily found.

• Are they significant?

• Masking, high-order Markov (e.g. M3) models

• Many mathematical results, lack of easy implementary algorithms

Numerical enumerative techniques decomposition

Imbedded Markov Chain (IMC)

(Kleffe & Langbecker, 1990; Fu & Koutras, 1994)

Example: x = 11011011, {0, 1} Overlap

Ending States: prefixes arranged according

to lengths. If a sequence belongs to state k,

it must end with the string of state k, but

not end with any strings with an index larger than k.

State(110110) = 6, State(110111) = 2

11011011

11011011

11011011

11011011

• Numerical enumerative techniques decomposition

• The IMC states are complete and exclusive.

• State transition matrix T (Boolean)

• P(n,l,k), the probability for an l long sequence with ending state kto

• have noccurrences

• Recursion relations

• P(n,l,j) = ∑imjTijP(n,l-1,i), if j≠ 8,

• P(n,l,8) = ∑im8Ti8P(n-1,l-1,i),

• mj denotes the probability for the last letter of the string of state j.

• Generalizations

• Alphabet size: binary quaternary, number of states + 2

• M0 to Mm: independent probability -> conditional probability,

• initialization, number of states = |A|**m + |x| - m

• Multiple words: States = Union of {single word states}

• DNA M3 model, x: TATAAA

• States: TATAAA, TATAA, TATA, 64 triplets including TAT;

• 64+(6-3) = 67

Numerical enumerative techniques decomposition

Known result for

Calculation of q, the probability with hits, and variation is not simple.

Introduce(Q(1,L,k) determines q)

(R(1,L,k)determines <n>)

(S(1,L,k)determiness)

Numerical enumerative techniques decomposition

Recursion relations

and similar relations for S. Using recursion relations for P, Q, R and S, we can calculate and to obtain q, and .

Search for pattern by alignment decomposition

Block: Position Specific Weight Matrix (PSWM)

Count number of occurrences of every letter at each position,

A simple probabilistic models of sequence patterns;

log likelihood or log odds;

Alignment without gaps.

Alignment score to determine optimal alignment for motif discovery

Match score for locating known motif in a new sequence

AAATGACTCA

AGATGAGTCA

GAGAGAGTCA

AGAAGAGTCA

GAATGAGTCA

AAATGACTCA

AAGAGAGTCA

GAATGAGTCA

AAATGAGTCA consensus

AAAWGAGTCA consensus with degenerate letter: W = A or T

TGACTCWTTT motif on the complement strand

• Motif search, a self-consistent problem decomposition

• Motif positions weight matrix

• If positions are known, the determination of matrix is straightforward.

• If the matrix is known, we can scan sequences for positions with the matrix.

• However, at the beginning we know neither the positions, nor the matrix.

• Break the loop by iteration

• Make an initial guess of motif positions,

• Estimate weight matrix,

• Update motif positions,

• Convergent? End : Go to 2.

• Updating: Greedy, Fuzzy, Random decomposition

• Two clusters of points on the plane

• Euclidian distance

• Two centers L and R

• initializing & updating

• Distances dL and dR from point X to the centers

• probability measure Pr(L|X) ∝ exp(- dL² )

• Greedy updating

• if dL > dR, X belongs to L

• Fuzzy updating

• mixture: membership ∝ probability

• Stochastic or random updating

• sampling according to probabilities

• Initialization strongly affects the final result.

R

L

R

L

R

L

Single motif in 30 sequences decomposition

VLHVLSDREKQIIDLTYIQNK 225 SQKETGDILGISQMHVSR LQRKAVKKLREALIEDPSMELM

LDDREKEVIVGRFGLDLKKEK 198 TQREIAKELGISRSYVSR IEKRALMKMFHEFYRAEKEKRK

MELRDLDLNLLVVFNQLLVDR 22 RVSITAENLGLTQPAVSN ALKRLRTSLQDPLFVRTHQGME

TRYQTLELEKEFHFNRYLTRR 326 RRIEIAHALCLTERQIKI WFQNRRMKWKKENKTKGEPGSG

METKNLTIGERIRYRRKNLKH 22 TQRSLAKALKISHVSVSQ WERGDSEPTGKNLFALSKVLQC

MNAY 5 TVSRLALDAGVSVHIVRD YLLRGLLRPVACTTGGYGLFDD

ELVLAEVEQPLLDMVMQYTRG 73 NQTRAALMMGINRGTLRK KLKKYGMN

SPQARAFLEQVFRRKQSLNSK 99 EKEEVAKKCGITPLQVRV WFINKRMRSK

ANKRNEALRIESALLNKIAML 25 GTEKTAEAVGVDKSQISR WKRDWIPKFSMLLAVLEWGVVD

LLNLAKQPDAMTHPDGMQIKI 169 TRQEIGQIVGCSRETVGR ILKMLEDQNLISAHGKTIVVYGTR

MYKKDVIDHFG 12 TQRAVAKALGISDAAVSQ WKEVIPEKDAYRLEIVTAGALKYQ

YNLSRRFAQRGFSPREFRLTM 196 TRGDIGNYLGLTVETISR LLGRFQKSGMLAVKGKYITIENND

SEAQPEMERTLLTTALRHTQG 444 HKQEAARLLGWGRNTLTR KLKELGME

MKAKKQETAA 11 TMKDVALKAKVSTATVSR ALMNPDKVSQATRNRVEKAAREVG

ETRREERIGQLLQELKRSDKL 23 HLKDAAALLGVSEMTIRR DLNNHSAPVVLLGGYIVLEPRSAS

MA 3 TIKDVARLAGVSVATVSR VINNSPKASEASRLAVHSAMESLS

MKPV 5 TLYDVAEYAGVSYQTVSR VVNQASHVSAKTREKVEAAMAELN

DKSKVINSALELLNEVGIEGL 26 TTRKLAQKLGVEQPTLYW HVKNKRALLDALAIEMLDRHHTHF

DEREALGTRVRIVEELLRGEM 67 SQRELKNELGAGIATITR GSNSLKAAPVELRQWLEEVLLKSD

WLDNSLDERQRLIAALEKAGW 495 VQAKAARLLGMTPRQVAY RIQIMDITMPRL

RRPKLTPEQWAQAGRLIAAGT 160 PRQKVAIIYDVGVSTLYK RFPAGDK

MA 3 TIKDVAKRANVSTTTVSH VINKTRFVAEETRNAVWAAIKELH

MA 3 TLKDIAIEAGVSLATVSR VLNDDPTLNVKEETKHRILEIAEK

ARQQEVFDLIRDHISQTGMPP 27 TRAEIAQRLGFRSPNAAE EHLKALARKGVIEIVSGASRGIRL

BLO decompositionck SUubstitution Matrix BLOSUM62

a matrix of amino acid similarity

SQKETGDILGISQMHVSR

TQREIAKELGISRSYVSR

RVSITAENLGLTQPAVSN

RRIEIAHALCLTERQIKI

TQRSLAKALKISHVSVSQ

TVSRLALDAGVSVHIVRD

NQTRAALMMGINRGTLRK

EKEEVAKKCGITPLQVRV

GTEKTAEAVGVDKSQISR

TRQEIGQIVGCSRETVGR

. . .

weight matrix substitution matrix

A R N D C Q E G ...

A 4 -1 -2 -2 0 -1 -1 0

R -1 5 0 -2 -3 1 0 -2

N -2 0 6 1 -3 0 0 0

D -2 -2 1 6 -3 0 2 -1

C 0 -3 -3 -3 9 -3 -4 -3

Q -1 1 0 0 -3 5 2 -2

E -1 0 0 2 -4 2 5 -2

G 0 -2 0 -1 -3 -2 -2 6

...

Initialization by similarity decomposition

VLHVLSDREKQIIDLTYIQNKSQKETGDILGISQMHVSRLQRKAVKKLREALIEDPSMELM

LDDREKEVIVGRFGLDLKKEKTQREIAKELGISRSYVSRIEKRALMKMFHEFYRAEKEKRK

Start motif search with BLOSUM62.

Two segments with a high similarity score are close neighbors.

Motif is regarded as a group of close neighbors.

Take each window of width 18 as a seed, skip the sequence that the seed is in, search every sequence for the most similar neighbor of the seed in each sequence which has a similarity not less than s0, and find the score sum of these neighbors.

A seed and its near neighbors form a block or star tree, from which a primitive weight matrix and a substitution matrix can be deduced. The weight matrix or the substitution matrix (with the seed) is then used to scan each sequence for updating positions.

Top 10 seeds decomposition

Block of seed a2=198 (23 sequences) weight matrix

new block new weight matrix … (convergence)

28/30 correct; seed a25=205: 29/30 correct

Block of seed a2=198 New substitution matrix

new block(27 sequences) weight matrix … (convergence)

30/30 correct;

Only greedy updating in use.

The motif might be wider. a2 = 198, 199, 197, 201,

Two motifs in 5 sequences decomposition

MOTIF A MOTIF B

KPVN 17 DFDLSAFAGAWHEIAK LPLE...NLVP 104 WVLATDYKNYAINYNC DYHP

QTMK 25 GLDIQKVAGTWYSLAM AASD...ENKV 109 LVLDTDYKKYLLFCME NSAE

KPVD 16 NFDWSNYHGKWWEVAK YPNS...ENVF 100 NVLSTDNKNYIIGYYC KYDE

RVKE 14 NFDKARFAGTWYAMAK KDPE...NDDH 105 WIIDTDYETFAVQYSC RLLN

STGR 27 NFNVEKINGEWHTIIL ASDK...FNTF 109 TIPKTDYDNFLMAHLI NEKD

Top six seeds decomposition

Two motifs are seen, with the second shifted by 1.

If , two blocks have been correct.

New substitution matrix discovers the correct blocks.

Repeating motifs decomposition

>>>SVLEIYDDEKNIEPALTKEFHKMYLDVAFEISLPPQMTALDASQPW 109 MLYWIANSLKVM DRDW

LSDD 129 TKRKIVDKLFTI SPSG 145 GPFGGGPGQLSH LA 159 STYAAINALSLC DNIDGC

WDRI 181 DRKGIYQWLISL KEPN 197 GGFKTCLEVGEV DTR 212 GIYCALSIATLL NI

LTEE 230 LTEGVLNYLKNC QNYE 246 GGFGSCPHVDEA HGG 261 YTFCATASLAIL RS

MDQI 279 NVEKLLEWSSAR QLQEE 296 RGFCGRSNKLVD GC 310 YSFWVGGSAAIL EAFGY

GQCF 331 NKHALRDYILYC CQEKEQ 349 PGLRDKPGAHSD FY 363 HTNYCLLGLAVA E

376 SSYSCTPNDSPH NIKCTPDRLIGSSKLTDVNPVYG

LPIE 415 NVRKIIHYFKSN LSSPS

>>>L 74 QREKHFHYLKRG LRQLTDAYE

CLDA 99 SRPWLCYWILHS LELLDEP

IPQI 122 VATDVCQFLELC QSPD 138 GGFGGGPGQYPH LA 152 PTYAAVNALCII GTEEA

YNVI 173 NREKLLQYLYSL KQPD 189 GSFLMHVGGEVD VR 203 SAYCAASVASLT NI

ITPD 221 LFEGTAEWIARC QNWE 237 GGIGGVPGMEAH GG 251 YTFCGLAALVIL KK

ERSL 269 NLKSLLQWVTSR QMRFE 286 GGFQGRCNKLVD GC 300 YSFWQAGLLPLL HRAL...

HWMF 331 HQQALQEYILMC CQCPA 348 GGLLDKPGKSRD FY 362 HTCYCLSGLSIA QHFG...

>>>L 8 LKEKHIRYIESL DTKKHNFEYWLTEHLRLN 38 GIYWGLTALCVL DS

PETF 56 VKEEVISFVLSC WDDKY 73 GAFAPFPRHDAH LL 87 TTLSAVQILATY DALDV

LGKD 108 RKVRLISFIRGN QLED 124 GSFQGDRFGEVD TR 138 FVYTALSALSIL GE

LTSE 156 VVDPAVDFVLKC YNFD 172 GGFGLCPNAESH AA 186 QAFTCLGALAIA NKLDM

LSDD 207 QLEEIGWWLCER QLPE 223 GGLNGRPSKLPD VC 237 YSWWVLSSLAII GR

LDWI 255 NYEKLTEFILKC QDEKK 272 GGISDRPENEVD VF 286 HTVFGVAGLSLM GYDN...

>>>L 19 LLEKHADYIASY GSKKDDYEYCMSEYLRMS 49 GVYWGLTVMDLM GQ

LHRM 67 NKEEILVFIKSC QHEC 83 GGVSASIGHDPH LL 97 YTLSAVQILTLY DS

IHVI 115 NVDKVVAYVQSL QKED 131 GSFAGDIWGEID TR 145 FSFCAVATLALL GK

LDAI 163 NVEKAIEFVLSC MNFD 179 GGFGCRPGSESH AG 193 QIYCCTGFLAIT SQ

LHQV 211 NSDLLGWWLCER QLPS 227 GGLNGRPEKLPD VC 241 YSWWVLASLKII GR

LHWI 259 DREKLRSFILAC QDEET 276 GGFADRPGDMVD PF 290 HTLFGIAGLSLL GEEQ...

>>>V 12 VTKKHRKFFERH LQLLPSSHQGHDVNRMAIIFYSI...TINLPNTLFALLSMIMLRDYEYF

ETIL 127 DKRSLARFVSKC QRPDRGSFVSCLDYKTNCGSSVDSDDLRFCYIAVAILYICGCRSKEDF

DEYI 191 DTEKLLGYIMSQ QCYN 207 GAFGAHNEPHSG 219 YTSCALSTLALL SSLEK

LSDK 240 FKEDTITWLLHR QV..DD 275 GGFQGRENKFAD TC 289 YAFWCLNSLHLL TKDW

KMLC 309 QTELVTNYLLDR TQKTLT 327 GGFSKNDEEDAD LY 341 HSCLGSAALALI EGKF...

Non-repeating motif decomposition

Repeating motif

Top seeds for W=12 decompositions0=18

From significant word to PSWM: decompositionTest on A. thaliana promoters

Collect neighbors by similarity. Extend the boundaries of aligned block to widen the PSWM. A measure for distance between site to non-site

where q(i) for putative signal site, p0(i) for background non-site.

PSWMs from words TATAAAT and AACCCTA

• Hidden Markov model (HMM) decomposition

• Statistical summary representation for conserved region

• Model stores probability of match, mismatch, insertions, and

• deletions at each position in sequence

• Alignment of conserved region not necessary, but helpful

• delete

• AAC insert

• ACG

• A-T

• AAT

• 123 match

Some topics in Bioinformatics: decomposition

An Introduction

Hidden Markov Model for

Protein Secondary Structure Prediction

Outline decomposition

• Protein structure

• A brief review of secondary structure prediction

• Hidden Markov model: simple-minded

• Hidden Markov model: realistic

• Discussion

• References

L-form decomposition

CORN-rule

R

R

H

ca

OH

H

ca

N

c

H

N

CO

H

O

R

H

O

H

N

c

ca

H

ca

H

OH

OH

N

c

H

R

H

O

Protein sequences are written in 20 letters ( decomposition20 Naturally-occurring amino acid residues): AVCDE FGHIW KLMNY PQRST

Hydrophobic

Charged+-

Polar

Small decomposition

P

Tiny

G

I

A

V

Aliphatic

L

C

S

N

T

D

Q

M

E

Y

K

F

H

R

Negative

W

Positive

Aromatic

Hydrophobic

Polar

Residues form a directed chain decomposition

Cis-

Trans-

~5%Pro + 0.03%non-Pro

< 0.3%

Rasmol ribbon diagram of GB1 decomposition

Helix, sheets and coil

Hydrogen-bond network

3D structure → secondary structure written in three letters: H, E, C.

H: E: C = 34.9: 21.8: 43.3

Amino acid sequence

a1a2 …

secondary structure

…EEEEEECCCCCEEEEECCCCCCCHHHHHHHHHHHHHCCCEEEEEEECCCCCEEEEE…

(In prediction stage)

Database Alignment

Single sequence based - -

Nearest Neighbor + -

Profile + +

Discriminant analysis

Neural network (NN) approach

Support vector machine (SVM)

Probabilistic modelling

Bayes formula prediction stage)

Count of

Generally, P(x, y) = P(x|y)P(y),

Protein sequence A, {a prediction stage)i}, i=1,2,…,n

Secondary structure sequence S, {si}, i=1,2,…,n

Secendary structure prediction:

1D amino acid sequences → 1D secondary structure sequence

An old problem for more than 30 years

Discriminant analysis in the early stage

1. Simple Chou-fasman approach

Chou-Fasman’s propensity of amino acid to conformational state

+ independence approximation

One residue, one state

Parameter Training prediction stage)

Propensities q(a,s)

Counts (20x3) from a database: N(a, s)

sum over a → N(s),

sum over s → N(a),

sum over a and s → N

q(a,s) = [N(a,s) N] / [N(a) N(s)].

• Propensity of amono acids to secondary structures prediction stage)

• helix EAL HMQWVF KI DTSRC NY PG

• strand MVI CYFQLTW A RGD KSHNP E

• Garnier-Osguthorpe-Robson Window version of propensity to the state at the center

• -8-7-6-5-4-3-2-1 0+1+2+3+4+5+6+7+8

• W R Q I C T V N A F L C E H S Y K

• HEC

• Based on assumption that each amino acid individually influences the propensity of the central residue to adopt a particular secondary structure.

• Each flanking position evaluated independently like a PSSM.

• 2. Garnier-Osguthorpe-Robson (GOR) window version prediction stage)

Conditional

Independency

Weight matrix (20x17)x3 P(W|s)

3. Improved GOR (20x20x16x3, to include pair correlation)

Width determined by mutual information.

Hidden Markov Model (HMM): simple-minded prediction stage)

Bayesian formula: P(S|A) = P(S,A)/P(A) ~ P(S,A) = P(A|S) P(S)

Simple version

emitting ai at si

Markov chain according to P(a|s)

For hidden sequence

Forward and backward functions

Tail

a1

a2

a3

s1

s2

s3

Initial conditions and recursion relations prediction stage)

Partition function

Prob(s prediction stage)i=s, si+1=s’) = Ai(s) tss’ P(ai+1|s’) Bi+1(s’)/Z

This marginalization can be done for any segment, e.g. Prob(si:j)

Number of different S increases exponentially with |S|.

Linear algorithm: Dynamic programming.

Two interpretations of ‘operator’

as ‘sum’: Baum-Welch

as ‘max’: Viterbi

• Hidden Markov Model: Realistic prediction stage)

• 1) Strong correlation in conformational states: at least two consecutive E and three consecutive H

• refined conformational states (243 → 75)

• 2) Emission probabilities → improved window scores

• Proportion of accurately predicted sites ~ 70% (compared with < 65% for prediction based on a single sequence)

• No post-prediction filtering

• Integrated (overall) estimation of refined conformation states

• Measure of prediction confidence

Discussions prediction stage)

• HMM using refined conformational states and window scores is efficient for protein secondary structure prediction.

• Better score system should cover more correlation between conformation and sequence.

• Combining homologous information will improve the prediction accuracy.

• From secondary structure to 3D structure (structure codes: discretized 3D conformational states)

• Three main ingredients of a HMM prediction stage)are

• hidden states

• Markov graph

• scores

• Model training is to determine the hidden parameters from the `observable' data. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition. An HMM can be considered as the simplest dynamic Bayesian network.

The length distribution of helices prediction stage)

t (hhhhh → hhhhh)from string counting is 0.88;

from length statistics:

t (hhhhh → hhhhh) =

is 0.82.

A simple approximation is to adjust the value at 0.83.

A more sophisticated way is to introduce more states for helices of different lengths, and make the geometric approximation only after certain length.

incoming states: 18 chhhh, 42 ehhhh, prediction stage)

outgoing states: 65 hhhhc, 66 hhhhe.

Other states of multi-h: [d = c or e]

67 dhhhhh, 68 hhhhhhh,

69 dhhhhhhd, 70 dhhhhhhhd, 71 dhhhhhhhhd, and 72 dhhhhhhhhh.

l = 6 7 8 ≥ 9

Letters in red are pre- or post-letters of a state.

Scoring based on reduction of amino acid alphabets

Pair sequence alignment viewed from HMM prediction stage)

VAPLSAAQAALVKSSWEEFNA--NIPKHTH template

ddd..................ii....... M0 hidden sequence in m,d,i

More on HMM

the profile HMM

VGA--HAGEY

V----NVDEV

VEA--DVAGH

VKG----PKD

VYS--TYETS

*** ***** template

Hidden states: M, I, D

Align a sequence to a profile.

D

I

V G A H A G E Y

V N V D E V

I A G- N G A G V

M

• Hidden Markov model prediction stage)

• A run in model M follows a Markovian path of states and generates a string S over a finite alphabet with probability PM(S).

• Typical HMM problems

• Annotation:Given a model M and an observed string S, what is the most probable path through M that generates/outputs S. Viterbi

• Classification:Given a model M and an observed string S, what is the total probability PM(S) of M generating S. Forward algorithm

• Training:Given a set of training strings and a model structure, find transition and emission probabilities that make the training set most probable.

• [Silent state: emits no symbols.]

• Other HMM problems

• Comparison:Given two models, what is a measure of their likeliness.

• Consensus:Given a model M, find the string S that have the highest probability under the model.

Some topics in Bioinformatics: prediction stage)

An Introduction

Protein Conformational Letters

and Structure Alignment

Conformational letters and substitution matrix prediction stage)

• Protein structure alphabet: discretization of 3D continuous structure states

• Bridge 2’ structure and 3D structure

• Enhance correlation between sequence and structure

• Fast structure comparison

• Structure prediction (transplant 2’ structure prediction methods for 3D)

• Discrete conformational states prediction stage)

• Secondary structure states: helix + strand ~30+22% & Loop

• Single secondary structure may vary significantly in 3D.

• Phase space partition: Ramachandran plot (Rooman, Kocher & Wodak, 1991)

• Library of representative fragments (Park & Levitt, 1995)

• representative points in the phase space

• General clustering (Oligons; Micheletti et al, 2000)

• Hidden Markov models (Camproux et al., 1999)

• distributions as modes

• pros: connectivity effect (correlation)

• cons: many parameters, inconvenient for assigning states to

• short segments

• Mixture model: clustering based on peaks in probability

• distribution

p prediction stage)

E: extended

B: beta

I,H: alpha

L,M: left

T,U: turn

P: Pro-rich

G: Gly-rich

N: NH-NH conflict

O: CO-CO conflict

S: side chain conflict

*: NH-CO conflict

O

M

E

P

B

S

y

L

T

N

0

*

U

I

H

F

G

E

-p

-p

0

p

j

AN APPROACH TO DETECTION OF PROTEIN STRUCTURAL MOTIFS USING AN ENCODING SCHEME OF BACKBONE CONFORMATIONS

H. Matsuda, F. Taniguchi, A. Hashimoto

The placement of a protein backbone chain with an icosahedron in the Cartesian coordinates.

Representation of 3D backbone: C-alpha pseudobond angles AN ENCODING SCHEME OF BACKBONE CONFORMATIONS

Four-residue fundamental unit: btb r1r2r3r4

1

4

t3

b3

~ 3.8 A

2

3

b2

b2t3b3

Sliding windows AN ENCODING SCHEME OF BACKBONE CONFORMATIONS

S1

S2

Pseudobonds: bend and torsion angles AN ENCODING SCHEME OF BACKBONE CONFORMATIONS

For dominant trans, |r| = 3.8

[0，p]

dihedral angle, sign ~

nresidures : btbtb...tb

n-2 bend angles + n-3 torsion angles = 2n-5angles

Origin is at r0, r01 falls on the x-axis, and r12 on the xy-plane. t1≡0

coordinates angles

3.14 0.9 AN ENCODING SCHEME OF BACKBONE CONFORMATIONS torsion -3.14

Bend and torsion angles of pseudobonds

3.14 1.9 bend 0.4 0.0

Peaks 1.55 1.10

Mixture model for the angle probability distribution AN ENCODING SCHEME OF BACKBONE CONFORMATIONS

c: the number of the normal distribution categories

Bayes formula:

Fuzzy vs. greedy assignment of structural codes

Mixture model for the angle probability distribution AN ENCODING SCHEME OF BACKBONE CONFORMATIONS

EM updating

Objective functions

EM~ Fuzzy

(Baum-Welch)

“Most possible”

(Viterbi) Greedy

• Downhill search for distribution density peaks AN ENCODING SCHEME OF BACKBONE CONFORMATIONS

• Objective function: counts in a box centered at the given point.

• Grid of initial points (2*5*2 *5*2)

• Box size (0.1~0.2 radians) ~ Parzon window width ~ resolution

• Three-angle space (4-residue unit) vs. five-angle space (5-residue unit)

• b1t1b2t2b3 ~ b1t1b2 + b2t2b3

• EM traning (fixed category number, approximate category centers)

• Tracing the Viterbi likelihood of the ‘most possible’ category states

• ‘Good’ centers survive after the training.

Downhill to determine category number c AN ENCODING SCHEME OF BACKBONE CONFORMATIONS

3.14 0 -3.14

torsion

Grid for initial points theta = 1.10, 1.55; tau = -2.80,-2.05, -1.00,0.00,0.87

btb phase space & btbtb phase space

Width of Parzen window

0 bend 3.14

17 structure states from the mixture model AN ENCODING SCHEME OF BACKBONE CONFORMATIONS

Conformational alphabet and CLESUM AN ENCODING SCHEME OF BACKBONE CONFORMATIONS

Conformational letters = discretized states of 3D segmental conformations.

A letter = a cluster of combinations of three angles formed by Capseudobonds of four contiguous residues. (obtained by clustering according to the probability distribution.)

Centers of 17 conformational letters

2’ structure and codes AN ENCODING SCHEME OF BACKBONE CONFORMATIONS

Mutual Information

between

Codes & 2’ structures

= 0.731

Forward transition rates AN ENCODING SCHEME OF BACKBONE CONFORMATIONS

Errors of conformational codes AN ENCODING SCHEME OF BACKBONE CONFORMATIONS

distance root mean squared deviation

Mean ~ 0.330

FSSP tree AN ENCODING SCHEME OF BACKBONE CONFORMATIONS(Arrows point to a representative template (2860 representatives).)

same first 3 family indices， 1.14M structural amino pairs

FSSP representative pairs with the same first three family indices are used to construct CLESUM.

amino acids

a.b.c.u.v.w avpetRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVSRS...

a.b.c.x.y.z ahLTVKKIFVGGIKEDT....EEHHLRDYFEQYGKIEVIEIMTDRGS

conformational letters

a.b.c.u.v.wCCPMCEALEEEENGCPJGCCIHHHHHHHHIKMJILQEPLDEEEBGAIK

a.b.c.x.y.z ...BBEBGEDEENMFNML....FAHHHHHKKMJJLCEBLDEBCECAKK

NAB++; NBA++;

Similarity between conformational letters indices are used to construct CLESUM.

CLESUM: Conformational LEtter SUbstitution Matrix

typical helix

evolutionary

+ geometric

typical sheet

Mij = 20* log 2 (Pij/PiPj) ~ BLOSUM83, H ~ 1.05

constructed using FSSP representatives.

Example: FSSP alignment and indices are used to construct CLESUM.CLESUM alignment

1urnA avpetRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVSRS... 1ha1b ahLTVKKIFVGGIKEDT....EEHHLRDYFEQYGKIEVIEIMTDRGS

1urnA CCPMCEALEEEENGCPJGCCIHHHHHHHHIKMJILQEPLDEEEBGAIK

1ha1b ...BBEBGEDEENMFNMLFA....HHHHHKKMJJLCEBLDEBCECAKK

1urnA LKMRGQAFVIFKEVSSATNALRSMqGFPFYDKPMRIQYAKTDSDIIAKM

1ha1b GKKRGFAFVTFDDHDSVDKIVIQ.kYHTVNGHNCEVRKAL

1urnA ...GNGEDBEEALAJHHHHHHIKKGNGCENOGCCEFECCALCCAHIJH

1ha1b AGCPOLEDEEEALBJHHHHI.IJGALEEENOGBFDEECC.........

gap panalty: (-12, -4)

CLESUM alignment indices are used to construct CLESUM.

for 1urnA and 2bopA have three common segments

for 1urnA and 1mli two common segments longer than 8

1urnA

1ha1

Sequence alignment

for 1urnA and 1ha1

two segments of lengths 13 and 21 of the sequence alignment coincide with the FSSP

no coincidence are seen in the other two cases.

2bopA

1mli

Entropic h-p clustering: indices are used to construct CLESUM.AVCFIWLMY and DEGHKNPQRST

CLESUM-hh (lower left) and CLESUM-pp (upper right)

CLESUM-hp for type indices are used to construct CLESUM.h-p (row-column)

• Fast indices are used to construct CLESUM.pair structure alignment

• Protein structures comparison: an extremely important problem in structural and evolutional biology.

• detection of local or global structural similarity

• prediction of the new protein's function.

• structures are better conserved → remote homology detection

• Structural comparison →

• organizing and classifying structures

• discovering structure patterns

• discovering correlation between structures & sequences →

• structure prediction

• CLePAPS: Pairwise structure alignment indices are used to construct CLESUM.

• Structure alignment --- a self-consistent problem

• Correspondence Rigid transformation

• However, when aligning two protein structures, at the beginning we know neither the transformation nor the correspondence.

• DALI, CE

• VAST

• STRUCTAL, ProSup

• CLePAPS: Conformational Letters based Pairwise Alignment of Protein Structures

• Initialization + iteration

• Similar Fragment Pairs (SFPs);

• Anchor-based;

• Alignment = As many consistent SFPs as possible

Rigid transformation for Superimposition indices are used to construct CLESUM.

Finding the rotation R and the translation T to minimize

If the two sets of points are shifted with their centers of mass at the origin, T=0. Let X3xn any Y3xn be the sets after shift. . Introduce

= ,

the objective function is defined as

where Lagrange multipliers are g and symmetric matrix L, representing the conditions for R to be an orthogonal and proper rotation matrix.

constraint:

where M is symmetric, and S = diag(si), si= 1 or -1. |C| = |R||M| = |M| = |D||S|.

Singular values are non-negative, |D|>0. Finally, |S| = sgn(|C|), and

DALI & CE indices are used to construct CLESUM.

Similar 3D structures have similar intra-molecular distances (or sub-patterns of distance matrices). Similar Fragment Pairs

VAST

Create vectors for SSEs

Align SSE

Refine residue alignment

SSE:

secondary structure element

SFPs indices are used to construct CLESUM.

Anchor-based superposition

consistent

anchor SFP

inconsistent

Collect as many consistent SFPs as possible

• SFP = highly scored string pair indices are used to construct CLESUM.

• Fast search for SFPs by string comparison

• CLESUM similarity score importance of SFPs

• Guided by CLESUM scores, only the top few SFPs need to be examined

• to determine the superposition for alignment, and hence a reliable greedy strategy becomes possible.

similar

seed

the smaller

kept

Redundancy removal

shaved

Selection of optimal anchor indices are used to construct CLESUM.

3

5

4

rank

1

2

Example: Top K, K = 2; J = 5

2

Anchor

1

Anchor

# of consistent SFBs = 4

# of consistent SFBs = 1

Top-1 SFB is globally supported by three other SFPs, while Top-2 SFB is supported only by itself.

‘Zoom-in’ indices are used to construct CLESUM.

d1

Anchor

d2

d3

d1 > d2 > d3

successively reduced cutoffs for maximal coordinate difference

specificity indices are used to construct CLESUM.

1/2

sensitivity

5/6

1

Flow chart of the CLePAPS algorithm

• Finding SFPs of high CLESUM similarity scores indices are used to construct CLESUM.

• The greedy `zoom-in' strategy

• Refinement by elongation and shrinking

• The Fischer benchmark test

• Database search with CLePAPS

• Multiple solutions of alignments: repeats, symmetry, domain move

• Non-topological alignment and domain shuffling

• Fast Multiple structure alignment indices are used to construct CLESUM.

• Multiple alignment carries significantly more information than pairwise alignment, and hence is a much more powerful tool for classifying proteins, detecting evolutionary relationship and common structural motifs, and assisting structure/function prediction.

• Most existing methods of multiple structural alignment combine a pairwise alignment and some heuristic with a progressive-type layout to merge pairwise alignments into a multiple alignment.

• like CLUSTAL-W, T-Coffee: MAMMOTH-mult, CE-MC

• slow

• alignments which are optimal for the whole input set might be missed

• A few true multiple alignment tools: MASS, MultiProt

• CE-MC: a multiple protein structure alignment server indices are used to construct CLESUM.

• CE all-against-all pairwise alignments;

• guide tree using the UPGMA in terms of Z-scores;

• progressive alignment by sequentially aligning structures according to the tree;

• highest scoring structure pair between two clusters is used to guide the alignment of the clusters;

• eligible column in the alignment contains residues (not gaps) in at least one-third of its rows. Column distance is defined as geometric distance averaged over residue pairs at each column.

• Scoring function

• M=20, l: total eligible columns; A=0, if di<d0; A=10, if di>d0; gap penalty G=15 (open) -7(extend).

• 7) Random trial moves are performed on the alignment one residue at a time or one column at a time and a new score is calculated for each trial move.

• Maximum number of structures = 25.

Vertical equivalency and horizontal consistency indices are used to construct CLESUM.

local similarity among consistent spatial

structures arrangement for a pair

• MultiProt – A Multiple Protein Structural Alignment Algorithm

• largest common point (LCP) set detection

• Multiple LCP: for each r, 2 ≤ r ≤ m, find the k largest e-congruent multiple alignments containing exactly r molecules.

• Detect structurally similar fragments of maximal length.

• 0) iteratively choose every molecule to be the pivot one.

• establish an initial, local correspondence, and the 3-D transformation

• calculate the global similarity based on somedistance measure.

• a) Multiple Structure Fragment Alignment. (protein, start, width), {p,i,l}

• b) Global Multiple Alignment. given fragment pair → pairwise correspondence → multiple correspondence → iterative improvement.

• c) Bio-Core detection. hydrophobic (A,V,I,L,M,C) polar/charged (S,T,P, D,E,K, R,H,Q,N), aromatic (F,Y,W) glycine (G).

• Multiple Alignment by Secondary Structures (MASS)

• two-level alignment using both secondary structure and atomic representation. SSE assignment a SSE representation a Local basis alignment aGrouping

• a Global extension a Filtering and scoring

• initial local alignments are obtained based on SSE coarse representation. Atomic coordinates are then used to refine and extend the initial alignments and so to obtain global atomic superpositions.

Highly similar fragment block (HSFB) Algorithm

Attributes of HSFB: width, positions, depth, score, consensus

Horizontal consistency of two HSFBs

similar

seed

template

inconsistent

superposition

consistent

c

a&c

b

pivot

a

a&b

anchor HSFB

1. Creating HSFBs Algorithm

create HSFBs using the shortest protein as a template

sort HSFBs according to depths, then to scores

derive redundancy-removed HFSBs by examining position overlap

If the new HSFB has a high proportion of positions which overlaps with existing HSFBs, remove it.

2. Selecting optimal HSFB

for each HSFB in top K

select the pivot protein based on the HSFB consensus;

superimpose anchored proteins on the pivot;

find consistent HSFBs;

A consistent HSFB contains at least 3 consistent SFPs.

select the best HSFB which admits most consistent HSFBs;

3. Building scaffold Algorithm

build a primary scaffold from the consistent HSFBs;

update the transformation using the consistent HSFBs;

recruit aligned fragments;

improve the scaffold;

create average template;

4. Dealing with unanchored proteins

Unanchored protein: has no member in the anchor HSFB which is supported by enough number of consistent SFPs.

for each unanchored protein

if (members are found in colored HSFBs) find top K

members;

Try to use ‘colored’ HSFBs other than the anchor HSFB.

else search for fragments similar to the scaffold,

and select top K;

pairwisely align the protein on the template;

5. Finding missing motifs Algorithm

find missing motifs by registering atoms in spatial cells;

Only patterns shared by the pivot protein have a chance to be discovered above. Two ways for discovering ‘missing motifs’: by searching for maximal star-trees and by registering atoms in spatial cell. The latter: We divide the space occupied by the structures after superimposition into uniform cubic cells of a finite size, say 6A. The number of different proteins = cell depth. Sort cells in descending order of depth.

find aligned fragments

6. Final refinement

refine the alignment and the average template;

• Conclusion Algorithm

• CLePAPS and BLOMAPS distinguish themselves from other existing algorithms for structure alignment in the use of conformational letters.

• Conformational alphabet: aptly balance precision with simplicity

• CLESUM: a proper measure of similarity between states

• fit the e-congruent problem

• CLESUM extracted from the database FSSP contains information of structure database statistics, which reduces the chance of accidental matching of two irrelevant helices. evolutionary + geometric = specificity gain

• For example, two frequent helices are geometrically very similar,

• but their score is relatively low.

• CLESUM similarity score can be used to sort the importance of SFP/HSFBs for a greedy algorithm. Only the top few SFP/HSFBs need to be examined.

Conclusion Algorithm

Greedy strategies:

Use the shortest protein to generate HSFB

Use consensus to select pivot

Top K --- guided by scores

Optimal anchor HSFB

Missing motifs

Tested on 17 structure datasets

Faster than MASS by 3 orders

The End Algorithm