Some topics in Bioinformatics: An introduction 1, Primary mathematical statistics

# Some topics in Bioinformatics: An introduction 1, Primary mathematical statistics

## Some topics in Bioinformatics: An introduction 1, Primary mathematical statistics

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Some topics in Bioinformatics: An introduction 1, Primary mathematical statistics 2, Sequence motif discovery 2.1 Imbedded Markov chain 2.2 Search for patterns by alignment 3, HMM for protein secondary structure prediction 4, Protein conformational letters and structure alignment 4.1 Conformational letters and substitution matrix 4.2 Fast pair structure alignment 4.3 Fast multiple structure alignment

2. Some topics in Bioinformatics: An Introduction Primary Mathematical Statistics

3. Mathematics: given cause, `calculate' outcomes. Science: given outcomes + available prior knowledge, infer possible causes. Mathematical statistics: Draw a conclusion from inconclusive data by plausible reasoning. ‘Probability and statistics’ unfortunate dichotomy: Probability makes sense, but is just a game; statistics, a bewildering collection of tests, is useful, but has little reason. It doesn't have to be like this. The Bayesian approach uses only probability theory, and makes the topic of statistics entirely superfluous. It provides the logic justification for many of the prevalent statistical tests, making explicit the conditions and approximations implicitly assumed in their use.

4. Contingency table Nb. = NbA + NbB, N.A = NbA + NgA, N = Nb. + Ng. = N.A + N.B. P(bA) = NbA / N, P(b) = Nb. / N, P(b) + P(g) = 1, P(A|b) = NbA / Nb., P(bA) = P(b)*P(A|b). P(b) = P(bA) + P(bB).

5. Three Formulae & Bayes’ Theorem Sum rule: Product rule: I: the relevant background information at hand. Marginalization is powerful device in dealing with nuisance parameters, which are necessary but of no intrinsic interest (e.g. background signal). Bayes' theorem posterior likelihood prior

6. By simple reinterpreting X as Model or Hypothesis and Y as Data, the formula becomes which forms the foundation for data analysis. In some situations, such as model selection, the omitted denominator plays a crucial role, so is given the special name of evidence. Parameter estimation:

7. Example: Student t-distribution and -distribution The MaxEnt distribution under the restriction of given mean and variance is the (Gaussian) normal distribution Maximal entropy = maximal uncertainty in prerequisite or prior conditions. Suppose { xi }are samples from a normal distribution with unknown parameters mand s. The likelihood function is Simple flat prior gives the marginal posterior P({x}|m,s) P(m,s) / P({x}) The estimate of parameter mis

8. Introducing the decomposition implies which is a t-distribution. Another marginal posterior is a -distribution.

9. Example: DNA sequence viewed as a outcome of die tossing A Nucleotide Die is a tetrahedron: {a, c, g, t}. The probability of observing a nucleotide a is p(a) = pa. Under the independent nonidentical distribution model, the probability to observe sequence s1s2…sn is P(s1s2…sn | p) = p(s1)p(s2)…p(sn) = Long repeats Suppose the probability for nucleotide a is p. The length Y of a repeat of a has the geometric distribution Thus, for the longest Ymax in n repeats, we have For an N long sequence, it is estimated that n = (1-p)N. We have the P-value = Prob(Ymax ≥ ymax) ≈ This discussion can be also used for word match between two sequences.

10. Entropies Shannon entropy: Relative entropy distance of {p} from {q} The equality is valid iff P and Q are identical. H measures the distance between P and Q. A related form (as leading terms of H) is Conditional entropy Mutual information (to measure independency or correlation) Maximal entropy ⇔ Minimal likelihood a relative entropy Maximal uncertainty to `unknown' (premise); Minimal uncertainty to `known' (outcome).

11. Nonparametric estimate of distribution (Parzen kernel) Replace a single data point (d-function) by a kernel function: where d is the dimensionality, and kernel K satisfies some conditions. The Gaussian kernel is For covariance S the distance is one of Mahalanobis; for identity S it is Euclidean. The optimal scale factor h may be fixed by minimize r or maximize m: Pseudocounts viewed as a ‘kernel’.

12. Sampling equilibrium distribution The global balance equation and the detail balance condition are There are two sets of probabilities: visiting probability from old state i to new j, and acceptance probability to accept j. Hastings-Metropolis Algorithm uses For Metropolis is assumed while for Gibbs

13. Jensen inequality For a random variable X and a continuous convex function g(x) on (a; b), we have Proof: Suppose that and g’(x0) = k. From the convexity we have which, by taking to be leads to By taking expectation, this yields the inequality. Discriminant analysis and cluster analysis Supervised learning and unsupervised learning

14. Example: Reduction of amino acid alphabet We are interested in the mutual information I(C;a) between conformation and amino acids Clustering amino acids leads to The difference between values of I after and before clustering is given by which, by introducing and defining is proportional to with f(x) = x*log x. From the Jensen theorem, for the convex function x log x here we have , so I never increases after any step of clustering.

15. (Conditional) probabilities represent logical connections rather than causal ones. Example of Jaynes (1989): Given a bag of balls, five red and seven green. When a ball is selected at `random', the probability of drawing a red is 5/12. If the first ball is not returned to the bag, then it seems reasonable that the probability of obtaining a red or green ball on the second draw will depend on the outcome of the first. Now suppose that we are not told the outcome of the first draw, but are given the result of the second draw. Does the probability of the first draw being red or green change with the knowledge of the second? The answer ‘no' is incorrect. The error becomes obvious in the extreme case of one red and one green. Probability as a degree-of-belief or plausibility; or a long-run relative frequency? Randomness is central to orthodox statistics and uncertainty inherent in Bayesian theory. Necessity and contingency / certainty and randomness Deterministic and stochastic Deterministic and predictable

16. Motif Discovery Wei-Mou Zheng Inst Theor Phys, Academia Sinica zheng@itp.ac.cn

17. Introduction Identifying motifs or signals is a very common task. Patterns: Shed light on structure and function, used for classifying An example: the transcription regulatory sites (TFBSs). Transcription factors and their binding sites (TFBS) gene regulatory proteins RNA polymerase TATA box Transcription start site (TSS)

18. Such motifs are • generally short, • quite variable, and • often dispersed over a large distance. • Two ways for representing motifs: • the consensus sequence, or • a position-specific weight matrix (PSWM) • Profile HMM • Methods for Motif Discovery • Consensus: Counting & evaluating • PSWM: Alignment & motif matrix updating

19. Imbedded Markov chain (IMC) • Consensus: • TATA-box:TATAAATA, Leucine zipper:L-X(6)-L-X(6)-L-X(6)-L • Counting & evaluating • Motifs are identified as over- or under-represented oligomers. • Count all possible DNA words of a certain length in an input set and then use statistics to evaluate over-represented words. • Compare observed (in input set) with expected (in a background model) • possible k-tuples increases exponentially • limited to simple, short motifs with a highly conserved core. • complication due to overlaps of words. • Independence approximation: ignore the correction of auto-correlation of pattern overlap, assume motifs occur independently at each position. • DNA alphabet size is only 4. • Simple patterns such as AAAA are easily found. • Are they significant? • Masking, high-order Markov (e.g. M3) models • Many mathematical results, lack of easy implementary algorithms

20. Numerical enumerative techniques Imbedded Markov Chain (IMC) (Kleffe & Langbecker, 1990; Fu & Koutras, 1994) Example: x = 11011011, {0, 1} Overlap Ending States: prefixes arranged according to lengths. If a sequence belongs to state k, it must end with the string of state k, but not end with any strings with an index larger than k. State(110110) = 6, State(110111) = 2 11011011 11011011 11011011 11011011

21. Numerical enumerative techniques • The IMC states are complete and exclusive. • State transition matrix T (Boolean) • P(n,l,k), the probability for an l long sequence with ending state kto • have noccurrences • Recursion relations • P(n,l,j) = ∑imjTijP(n,l-1,i), if j≠ 8, • P(n,l,8) = ∑im8Ti8P(n-1,l-1,i), • mj denotes the probability for the last letter of the string of state j. • Generalizations • Alphabet size: binary quaternary, number of states + 2 • M0 to Mm: independent probability -> conditional probability, • initialization, number of states = |A|**m + |x| - m • Multiple words: States = Union of {single word states} • DNA M3 model, x: TATAAA • States: TATAAA, TATAA, TATA, 64 triplets including TAT; • 64+(6-3) = 67

22. Numerical enumerative techniques Known result for Calculation of q, the probability with hits, and variation is not simple. Introduce(Q(1,L,k) determines q) (R(1,L,k)determines <n>) (S(1,L,k)determiness)

23. Numerical enumerative techniques Recursion relations and similar relations for S. Using recursion relations for P, Q, R and S, we can calculate and to obtain q, and .

24. Search for pattern by alignment Block: Position Specific Weight Matrix (PSWM) Count number of occurrences of every letter at each position, A simple probabilistic models of sequence patterns; log likelihood or log odds; Alignment without gaps. Alignment score to determine optimal alignment for motif discovery Match score for locating known motif in a new sequence AAATGACTCA AGATGAGTCA GAGAGAGTCA AGAAGAGTCA GAATGAGTCA AAATGACTCA AAGAGAGTCA GAATGAGTCA AAATGAGTCA consensus AAAWGAGTCA consensus with degenerate letter: W = A or T TGACTCWTTT motif on the complement strand

25. Motif search, a self-consistent problem • Motif positions weight matrix • If positions are known, the determination of matrix is straightforward. • If the matrix is known, we can scan sequences for positions with the matrix. • However, at the beginning we know neither the positions, nor the matrix. • Break the loop by iteration • Make an initial guess of motif positions, • Estimate weight matrix, • Update motif positions, • Convergent? End : Go to 2.

26. Updating: Greedy, Fuzzy, Random • Two clusters of points on the plane • Euclidian distance • Two centers L and R • initializing & updating • Distances dL and dR from point X to the centers • probability measure Pr(L|X) ∝ exp(- dL² ) • Greedy updating • if dL > dR, X belongs to L • Fuzzy updating • mixture: membership ∝ probability • Stochastic or random updating • sampling according to probabilities • Initialization strongly affects the final result. R L R L R L

27. Single motif in 30 sequences VLHVLSDREKQIIDLTYIQNK 225 SQKETGDILGISQMHVSR LQRKAVKKLREALIEDPSMELM LDDREKEVIVGRFGLDLKKEK 198 TQREIAKELGISRSYVSR IEKRALMKMFHEFYRAEKEKRK MELRDLDLNLLVVFNQLLVDR 22 RVSITAENLGLTQPAVSN ALKRLRTSLQDPLFVRTHQGME TRYQTLELEKEFHFNRYLTRR 326 RRIEIAHALCLTERQIKI WFQNRRMKWKKENKTKGEPGSG HRILKEIEIPLLTAALAATRG 449 NQIRAADLLGLNRNTLRK KIRDLDIQVYRSGG METKNLTIGERIRYRRKNLKH 22 TQRSLAKALKISHVSVSQ WERGDSEPTGKNLFALSKVLQC MNAY 5 TVSRLALDAGVSVHIVRD YLLRGLLRPVACTTGGYGLFDD ELVLAEVEQPLLDMVMQYTRG 73 NQTRAALMMGINRGTLRK KLKKYGMN SPQARAFLEQVFRRKQSLNSK 99 EKEEVAKKCGITPLQVRV WFINKRMRSK ANKRNEALRIESALLNKIAML 25 GTEKTAEAVGVDKSQISR WKRDWIPKFSMLLAVLEWGVVD LLNLAKQPDAMTHPDGMQIKI 169 TRQEIGQIVGCSRETVGR ILKMLEDQNLISAHGKTIVVYGTR MEQRITLKDYAMRF 15 GQTKTAKDLGVYQSAINK AIHAGRKIFLTINADGSVYAEEVK MYKKDVIDHFG 12 TQRAVAKALGISDAAVSQ WKEVIPEKDAYRLEIVTAGALKYQ MDNRVREACQYISDHLADSNF 196 DIASVAQHVCLSPSRLSH LFRQQLGISVLSWREDQRISQAKL YNLSRRFAQRGFSPREFRLTM 196 TRGDIGNYLGLTVETISR LLGRFQKSGMLAVKGKYITIENND GLDERSQDIIRARWLDEDNKS 252 TLQELADRYGVSAERVRQ LEKNAMKKLRAAIEA SEAQPEMERTLLTTALRHTQG 444 HKQEAARLLGWGRNTLTR KLKELGME MKAKKQETAA 11 TMKDVALKAKVSTATVSR ALMNPDKVSQATRNRVEKAAREVG ETRREERIGQLLQELKRSDKL 23 HLKDAAALLGVSEMTIRR DLNNHSAPVVLLGGYIVLEPRSAS MA 3 TIKDVARLAGVSVATVSR VINNSPKASEASRLAVHSAMESLS MKPV 5 TLYDVAEYAGVSYQTVSR VVNQASHVSAKTREKVEAAMAELN DKSKVINSALELLNEVGIEGL 26 TTRKLAQKLGVEQPTLYW HVKNKRALLDALAIEMLDRHHTHF DEREALGTRVRIVEELLRGEM 67 SQRELKNELGAGIATITR GSNSLKAAPVELRQWLEEVLLKSD WLDNSLDERQRLIAALEKAGW 495 VQAKAARLLGMTPRQVAY RIQIMDITMPRL LNEREKQIMELRFGLVGEEEK 205 TQKDVADMMGISQSYISR LEKRIIKRLRKEFNKMV RRPKLTPEQWAQAGRLIAAGT 160 PRQKVAIIYDVGVSTLYK RFPAGDK MA 3 TIKDVAKRANVSTTTVSH VINKTRFVAEETRNAVWAAIKELH MA 3 TLKDIAIEAGVSLATVSR VLNDDPTLNVKEETKHRILEIAEK ARQQEVFDLIRDHISQTGMPP 27 TRAEIAQRLGFRSPNAAE EHLKALARKGVIEIVSGASRGIRL TSTRKKANAITSSILNRIAIR 25 GQRKVADALGINESQISR WKGDFIPKMGMLLAVLEWGVEDEE

28. BLOck SUubstitution Matrix BLOSUM62 a matrix of amino acid similarity SQKETGDILGISQMHVSR TQREIAKELGISRSYVSR RVSITAENLGLTQPAVSN RRIEIAHALCLTERQIKI NQIRAADLLGLNRNTLRK TQRSLAKALKISHVSVSQ TVSRLALDAGVSVHIVRD NQTRAALMMGINRGTLRK EKEEVAKKCGITPLQVRV GTEKTAEAVGVDKSQISR TRQEIGQIVGCSRETVGR . . . weight matrix substitution matrix A R N D C Q E G ... A 4 -1 -2 -2 0 -1 -1 0 R -1 5 0 -2 -3 1 0 -2 N -2 0 6 1 -3 0 0 0 D -2 -2 1 6 -3 0 2 -1 C 0 -3 -3 -3 9 -3 -4 -3 Q -1 1 0 0 -3 5 2 -2 E -1 0 0 2 -4 2 5 -2 G 0 -2 0 -1 -3 -2 -2 6 ...

29. Initialization by similarity VLHVLSDREKQIIDLTYIQNKSQKETGDILGISQMHVSRLQRKAVKKLREALIEDPSMELM LDDREKEVIVGRFGLDLKKEKTQREIAKELGISRSYVSRIEKRALMKMFHEFYRAEKEKRK Start motif search with BLOSUM62. Two segments with a high similarity score are close neighbors. Motif is regarded as a group of close neighbors. Take each window of width 18 as a seed, skip the sequence that the seed is in, search every sequence for the most similar neighbor of the seed in each sequence which has a similarity not less than s0, and find the score sum of these neighbors. A seed and its near neighbors form a block or star tree, from which a primitive weight matrix and a substitution matrix can be deduced. The weight matrix or the substitution matrix (with the seed) is then used to scan each sequence for updating positions.

30. Top 10 seeds Block of seed a2=198 (23 sequences) weight matrix new block new weight matrix … (convergence) 28/30 correct; seed a25=205: 29/30 correct Block of seed a2=198 New substitution matrix new block(27 sequences) weight matrix … (convergence) 30/30 correct; Only greedy updating in use. The motif might be wider. a2 = 198, 199, 197, 201,

31. Two motifs in 5 sequences MOTIF A MOTIF B KPVN 17 DFDLSAFAGAWHEIAK LPLE...NLVP 104 WVLATDYKNYAINYNC DYHP QTMK 25 GLDIQKVAGTWYSLAM AASD...ENKV 109 LVLDTDYKKYLLFCME NSAE KPVD 16 NFDWSNYHGKWWEVAK YPNS...ENVF 100 NVLSTDNKNYIIGYYC KYDE RVKE 14 NFDKARFAGTWYAMAK KDPE...NDDH 105 WIIDTDYETFAVQYSC RLLN STGR 27 NFNVEKINGEWHTIIL ASDK...FNTF 109 TIPKTDYDNFLMAHLI NEKD

32. Top six seeds Two motifs are seen, with the second shifted by 1. If , two blocks have been correct. New substitution matrix discovers the correct blocks.

33. Repeating motifs >>>SVLEIYDDEKNIEPALTKEFHKMYLDVAFEISLPPQMTALDASQPW 109 MLYWIANSLKVM DRDW LSDD 129 TKRKIVDKLFTI SPSG 145 GPFGGGPGQLSH LA 159 STYAAINALSLC DNIDGC WDRI 181 DRKGIYQWLISL KEPN 197 GGFKTCLEVGEV DTR 212 GIYCALSIATLL NI LTEE 230 LTEGVLNYLKNC QNYE 246 GGFGSCPHVDEA HGG 261 YTFCATASLAIL RS MDQI 279 NVEKLLEWSSAR QLQEE 296 RGFCGRSNKLVD GC 310 YSFWVGGSAAIL EAFGY GQCF 331 NKHALRDYILYC CQEKEQ 349 PGLRDKPGAHSD FY 363 HTNYCLLGLAVA E 376 SSYSCTPNDSPH NIKCTPDRLIGSSKLTDVNPVYG LPIE 415 NVRKIIHYFKSN LSSPS >>>L 74 QREKHFHYLKRG LRQLTDAYE CLDA 99 SRPWLCYWILHS LELLDEP IPQI 122 VATDVCQFLELC QSPD 138 GGFGGGPGQYPH LA 152 PTYAAVNALCII GTEEA YNVI 173 NREKLLQYLYSL KQPD 189 GSFLMHVGGEVD VR 203 SAYCAASVASLT NI ITPD 221 LFEGTAEWIARC QNWE 237 GGIGGVPGMEAH GG 251 YTFCGLAALVIL KK ERSL 269 NLKSLLQWVTSR QMRFE 286 GGFQGRCNKLVD GC 300 YSFWQAGLLPLL HRAL... HWMF 331 HQQALQEYILMC CQCPA 348 GGLLDKPGKSRD FY 362 HTCYCLSGLSIA QHFG... >>>L 8 LKEKHIRYIESL DTKKHNFEYWLTEHLRLN 38 GIYWGLTALCVL DS PETF 56 VKEEVISFVLSC WDDKY 73 GAFAPFPRHDAH LL 87 TTLSAVQILATY DALDV LGKD 108 RKVRLISFIRGN QLED 124 GSFQGDRFGEVD TR 138 FVYTALSALSIL GE LTSE 156 VVDPAVDFVLKC YNFD 172 GGFGLCPNAESH AA 186 QAFTCLGALAIA NKLDM LSDD 207 QLEEIGWWLCER QLPE 223 GGLNGRPSKLPD VC 237 YSWWVLSSLAII GR LDWI 255 NYEKLTEFILKC QDEKK 272 GGISDRPENEVD VF 286 HTVFGVAGLSLM GYDN... >>>L 19 LLEKHADYIASY GSKKDDYEYCMSEYLRMS 49 GVYWGLTVMDLM GQ LHRM 67 NKEEILVFIKSC QHEC 83 GGVSASIGHDPH LL 97 YTLSAVQILTLY DS IHVI 115 NVDKVVAYVQSL QKED 131 GSFAGDIWGEID TR 145 FSFCAVATLALL GK LDAI 163 NVEKAIEFVLSC MNFD 179 GGFGCRPGSESH AG 193 QIYCCTGFLAIT SQ LHQV 211 NSDLLGWWLCER QLPS 227 GGLNGRPEKLPD VC 241 YSWWVLASLKII GR LHWI 259 DREKLRSFILAC QDEET 276 GGFADRPGDMVD PF 290 HTLFGIAGLSLL GEEQ... >>>V 12 VTKKHRKFFERH LQLLPSSHQGHDVNRMAIIFYSI...TINLPNTLFALLSMIMLRDYEYF ETIL 127 DKRSLARFVSKC QRPDRGSFVSCLDYKTNCGSSVDSDDLRFCYIAVAILYICGCRSKEDF DEYI 191 DTEKLLGYIMSQ QCYN 207 GAFGAHNEPHSG 219 YTSCALSTLALL SSLEK LSDK 240 FKEDTITWLLHR QV..DD 275 GGFQGRENKFAD TC 289 YAFWCLNSLHLL TKDW KMLC 309 QTELVTNYLLDR TQKTLT 327 GGFSKNDEEDAD LY 341 HSCLGSAALALI EGKF...

34. Non-repeating motif Repeating motif

35. Top seeds for W=12 s0=18

36. Possible multiple motifs

37. From significant word to PSWM:Test on A. thaliana promoters Collect neighbors by similarity. Extend the boundaries of aligned block to widen the PSWM. A measure for distance between site to non-site where q(i) for putative signal site, p0(i) for background non-site. PSWMs from words TATAAAT and AACCCTA

38. Hidden Markov model (HMM) • Statistical summary representation for conserved region • Model stores probability of match, mismatch, insertions, and • deletions at each position in sequence • Alignment of conserved region not necessary, but helpful • delete • AAC insert • ACG • A-T • AAT • 123 match

39. Some topics in Bioinformatics: An Introduction Hidden Markov Model for Protein Secondary Structure Prediction

40. Outline • Protein structure • A brief review of secondary structure prediction • Hidden Markov model: simple-minded • Hidden Markov model: realistic • Discussion • References

41. L-form CORN-rule R R H ca OH H ca N c H N CO H O R H O H N c ca H ca H OH OH N c H R H O

42. Protein sequences are written in 20 letters (20 Naturally-occurring amino acid residues): AVCDE FGHIW KLMNY PQRST Hydrophobic Charged+- Polar

43. Small P Tiny G I A V Aliphatic L C S N T D Q M E Y K F H R Negative W Positive Aromatic Hydrophobic Polar

44. Residues form a directed chain Cis- Trans- ~5%Pro + 0.03%non-Pro < 0.3%

45. Rasmol ribbon diagram of GB1 Helix, sheets and coil Hydrogen-bond network 3D structure → secondary structure written in three letters: H, E, C. H: E: C = 34.9: 21.8: 43.3 Amino acid sequence a1a2 … → secondary structure …EEEEEECCCCCEEEEECCCCCCCHHHHHHHHHHHHHCCCEEEEEEECCCCCEEEEE…

46. (In prediction stage) Database Alignment Single sequence based - - Nearest Neighbor + - Profile + + Discriminant analysis Neural network (NN) approach Support vector machine (SVM) Probabilistic modelling

47. Bayes formula Count of Generally, P(x, y) = P(x|y)P(y),

48. Protein sequence A, {ai}, i=1,2,…,n Secondary structure sequence S, {si}, i=1,2,…,n Secendary structure prediction: 1D amino acid sequences → 1D secondary structure sequence An old problem for more than 30 years Discriminant analysis in the early stage 1. Simple Chou-fasman approach Chou-Fasman’s propensity of amino acid to conformational state + independence approximation One residue, one state

49. Parameter Training Propensities q(a,s) Counts (20x3) from a database: N(a, s) sum over a → N(s), sum over s → N(a), sum over a and s → N q(a,s) = [N(a,s) N] / [N(a) N(s)].

50. Propensity of amono acids to secondary structures • helix EAL HMQWVF KI DTSRC NY PG • strand MVI CYFQLTW A RGD KSHNP E • Garnier-Osguthorpe-Robson Window version of propensity to the state at the center • -8-7-6-5-4-3-2-1 0+1+2+3+4+5+6+7+8 • W R Q I C T V N A F L C E H S Y K • HEC • Based on assumption that each amino acid individually influences the propensity of the central residue to adopt a particular secondary structure. • Each flanking position evaluated independently like a PSSM.