Biological Sequence Analysis
1 / 66

Biological Sequence Analysis - PowerPoint PPT Presentation

  • Updated On :

Biological Sequence Analysis. Lecture 26, Statistics 246 April 27, 2004. Synopsis. Some biological background A progression of models Acknowledgements References. The objects of our study. DNA, RNA and proteins: macromolecules which are unbranched polymers built up from smaller units.

Related searches for Biological Sequence Analysis

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Biological Sequence Analysis' - jared

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

Biological Sequence Analysis

Lecture 26, Statistics 246

April 27, 2004

Synopsis l.jpg

  • Some biological background

  • A progression of models

  • Acknowledgements

  • References

The objects of our study l.jpg
The objects of our study

  • DNA, RNA and proteins: macromolecules which are unbranched polymers built up from smaller units.

  • DNA: units are the nucleotide residues A, C, G and T

  • RNA: units are the nucleotide residues A, C, G and U

  • Proteins: units are the amino acid residues A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y.

  • To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure.

The central dogma l.jpg






The central dogma




Motifs sites signals domains l.jpg
Motifs - Sites - Signals - Domains

  • For this lecture, I’ll use these terms interchangeably to describe recurring elements of interest to us.

  • In PROTEINS we have: transmembrane domains, coiled-coil domains, EGF-like domains, signal peptides, phosphorylation sites, antigenic determinants, ...

  • In DNA / RNA we have: enhancers, promoters, terminators, splicing signals, translation initiation sites, centromeres, ...

Motifs and models l.jpg
Motifs and models

  • Motifs typically represent regions of structural significance with specific biological function.

  • Are generalisations from known examples.

  • The models can be highly specific.

  • Multiple models can be used to give higher sensitivity & specificity in their detection.

  • Can sometimes be generated automatically from examples or multiple alignments.

The use of stochastic models for motifs l.jpg
The use of stochastic models for motifs

  • Can be descriptive, predictive or everything else in between…..almost business as usual.

  • However, stochastic mechanisms should never be takenliterally, but nevertheless they can be amazingly useful.

  • Care is always needed: a model or method can breakdown at any time without notice.

  • Biological confirmation of predictions is almost always necessary.

Transcription initiation in e coli rna polymerase promotor interactions l.jpg
Transcription initiation in E. coliRNA polymerase-promotor interactions

In E. coli transcription is initiated at the promotor , whose

sequence is recognised by the Sigma factor of RNA polymerase.

Transcription initiation in e coli cont l.jpg
Transcription initiation in E. coli, cont.


Determinism 1 consensus sequences l.jpg
Determinism 1: consensus sequences

  •  Factor Promotor consensus sequence

  • -35-10



  • Similarly for 32 ,38and 54.

  • Consensus sequences have the obvious limitation: there is usually some deviation from them.

The human transcription factor sp1 l.jpg
The human transcription factor Sp1

has 3 Cys-Cys-His-His zinc finger DNA binding domains

Determinism 2 regular expressions l.jpg
Determinism 2: regular expressions

  • The characteristicmotif of a Cys-Cys-His-Hiszinc finger DNA binding domain has regular expression

  • C-X(2,4)-C-X(3)-[LIVMFYWC]-X(8)-H-X(3,5)-H

  • Here, as in algebra, X is unknown. The 29 a.a. sequence of our example domain 1SP1 is as follows, clearly fitting the model.


Prosite patterns l.jpg
Prosite patterns

  • An early effort at collecting descriptors for functionally important protein motifs. They do not attempt to describe a complete domain or protein, but simply try to identify the most important residue combinations, such as the catalytic site of an enzyme. They use regular expression syntax, and focus on the most highly conserved residues in a protein family.


More on prosite patterns l.jpg
More on Prosite patterns

This pattern, which must be in the N-terminal of the sequence (‘<'), means:

<A - x - [ST] (2) - x(0,1) - V - {LI}

Ala-any- [Ser or Thr]-[Ser or Thr] - (any or none)-Val-(any but Leu, Ile)

Searching with regular expressions l.jpg


PatternFind output

[ISREC-Server] Date: Wed Aug 22 13:00:41 MET 2001


gp|AF234161|7188808|01AEB01ABAC4F945 nuclear protein NP94b [Homo sapiens] Occurences: 2




Searching with regular expressions

Regular expressions can be limiting l.jpg
Regular expressions can be limiting

  • The regular expression syntax is still too rigid to represent many highly divergent protein motifs.

  • Also, short patterns are sometimes insufficient with today’s large databases. Even requiring perfect matches you might find many false positives. On the other site some real sites might not be perfect matches.

  • We need to go beyond apparently equally likely alternatives, and ranges for gaps. We deal with the former first, having a distribution at each position.

Cys cys his his profile sequence logo form l.jpg
Cys-Cys-His-His profile: sequence logo form

A sequence logo is a scaled position-specific a.a.distribution. Scaling is by a measure of a position’s information content.

Weight matrix model wmm stochastic consensus sequence l.jpg





9 214 63 142 118 8

22 7 26 31 52 13

18 2 29 38 29 5

193 19 124 31 43 216





Counts from 242 known 70 sites Relative frequencies: fbl








-38 19 1 12 10 -48

-15 -38 -8 -10 -3 -32

-13 -48 -6 -7 -10 -40

17 -32 8 -9 -6 19







10log2fbl/pb Informativeness:2+Sbpbllog2pbl

Weight matrix model (WMM) = Stochastic consensus sequence

0.04 0.88 0.26 0.590.49 0.03

0.09 0.03 0.11 0.13 0.21 0.05

0.07 0.01 0.12 0.16 0.12 0.02

0.80 0.08 0.51 0.13 0.18 0.89

Interpretation of weight matrix entries l.jpg
Interpretation of weight matrix entries

candidate sequence CTATAATC....

aligned position 123456


S=site (and independence)

R=random (equiprobable, independence)

log2 = log2

= (2+log2.09)+...+(2+log2.01)


Generally, score sbl = log fbl/pb

l=position, b=base

pb=background frequency

Use of the matrix to find sites l.jpg

-38 19 1 12 10 -48

-15 -38 -8 -10 -3 -32

-13 -48 -6 -7 -10 -40

17 -32 8 -9 -6 19

-38 19 1 12 10 -48

-15 -38 -8 -10 -3 -32

-13 -48 -6 -7 -10 -40

17 -32 8 -9 -6 19













-38 19 1 12 10 -48

-15 -38 -8 -10 -3 -32

-13 -48 -6 -7 -10 -40

17 -32 8 -9 -6 19

Use of the matrix to find sites



Move the matrix

along the sequence

and score each


Peaks should occur at the

true sites.

Of course in general any threshold will have some false positive and false negative rate.




Profiles l.jpg

Are a variation of the position specific scoring matrix approach just described. Profiles are calculated slightly differently to reflect amino acid substitutions, and the possibility of gaps, but are used in the same way.

In general a profile entry Mla for location l and amino acid a is calculated by

Mla = ∑bwlbSab

where b ranges over amino acids, wlb is a weight(e.g. the observed frequency of a.a. b in position l) and Sab is the (a,b)-entry of a substitution matrix (e.g. PAM or BLOSUM) calculated as a likelihood ratio.

Position specific gap penalties can also be included.

Slide25 l.jpg





Cons A B C D E ... Gap Len

13C 30 -40 150 -50 -60 100 100

14K 4 18 -11 17 17 100 100

15A 53 11 19 12 12 30 30

16S 28 21 28 19 16 100 100

We search with a profile in the same way as we did before.

These days profiles of this kind have been largely replaced by

Hidden Markov Models for sequence searching and alignment.

Modelling motifs the next steps l.jpg
Modelling motifs: the next steps

  • Missing from the weight matrix models of motifs and profiles are good ways of dealing with:

  • Length distributions for insertions/deletions

  • Local and non-local association of amino acids

  • Hidden Markov Models (HMM) help with the first. Dealing with the second remains a hard unsolved problem.

Hidden markov models l.jpg
Hidden Markov Models

  • Processes {(St, Ot), t=1,…}, where Stis the hidden

  • state and Ot the observation at time t, such that

  • pr(St | St-1, Ot-1,St-2 , Ot-2 …) = pr(St | St-1)

  • pr(Ot | St ,St-1, Ot-1,St-2 , Ot-2 …) = pr(Ot | St, St-1)

  • The basics of HMM were laid bare in a series of beautiful papers by L E Baum and colleagues around 1970, and their formulation has been used almost unchanged to this day.

Hidden markov models extensions l.jpg
Hidden Markov Models:extensions

  • Many variants are now used. For example, the distribution of O may depend only on previous S, or also on previous O values,

  • pr(Ot | St , St-1 ,Ot-1 ,..) = pr(Ot| St ), or

  • pr(Ot | St , St-1 ,Ot-1 ,..) = pr(Ot | St ,St-1 ,Ot-1) .

  • Most importantly for us, the times of S and O may be decoupled, permitting the Observation corresponding to Statetimet to be a string whose length and composition depends on St(and possibly St-1 and part or all of the previous Observations). This is called a hidden semi-Markov or generalized hidden Markov model.

Slide29 l.jpg

Some current applications of HMM to biology

mapping chromosomes

aligning biological sequences

predicting sequence structure

inferring evolutionary relationships

finding genes in DNA sequence

Some early applications of HMM

finance, but we never saw them

speech recognition

modelling ion channels

In the mid-late 1980s HMMs entered genetics and

molecular biology, and they are now firmly entrenched.

The algorithms l.jpg
The algorithms

  • As the name suggests, with an HMM the series O= (O1,O2 ,O3 ,……., OT) is observed, while the states S= (S1 ,S2 ,S3 ,……., ST) are not.

  • There are elegant algorithms for calculating pr(O|),arg max pr(O|) in certain specialcases, and arg maxSpr(S|O,).

  • Here  are the parameters of the model, e.g. transition and observation probabilities.

Profile hmm stochastic regular expressions l.jpg
Profile HMM = stochastic regular expressions

M = Match state, I = Insert state, D = Delete state.

To operate, go from left to right. I and M states output

amino acids; B, D and E states are “silent”.

How profile hmm are used l.jpg
How profile HMM are used

  • Instances of the motif are identified by calculating

  • log{pr(sequence | M)/pr(sequence | B)},

  • where M and B are the motif and background HMM.

  • Alignments of instances of the motif to the HMM are found by calculating

  • arg maxstates pr(states | instance, M).

  • Estimationof HMM parameters is by calculating

  • arg maxparameterspr(sequences | M, parameters).

  • In all cases, we use the efficient HMM algorithms.

Pfam domain hmm l.jpg
Pfam domain-HMM

  • Pfam is a library of models of recurrent protein domains. They are constructed semi-automatically using profile hidden Markov models.

  • Pfam families have permanent accession numbers and contain functional annotation and cross-references to other databases, while Pfam-B families are re-generated at each release and are unannotated.

  • See

Outline l.jpg

  • Some Biological Background

  • Models and Modelling

  • Results and Discussions

  • Future Work

Slide36 l.jpg

Cartoon of gene structure

(5’ Splice site)

(3’ Splice site)

12 examples of 5 splice donor sites l.jpg
12 examples of 5’splice (donor) sites















Probability models for short dna motifs l.jpg
Probability models for short DNA motifs

  • Short:~6-20 base pairs

  • DNA motifs:splice sites, transcription factor binding sites, translation initiation sites,enhancers, promoters,...

  • Why probability models?

  • tocharacterizethe motifs

  • to helpidentifythem

  • forincorporationinto larger models

  • Aim: given a number of instances of a DNA sequence motif, we want a modelfor the probabilityof that motif.

Weight matrix models staden 1984 l.jpg
Weight matrix models, Staden (1984)

  • Base-3 -2 -10 +1 +2 +3 +4 +5

  • A33 61 100 0 53 71 7 16

  • C 37 13 30 0 3 8 6 16

  • G 18 12 80 100 0 42 12 81 22

  • T12 14 70 100 2 9 6 46

A weight matrix for donor sites.

Essentially a mutual independencemodel.

An improvement over the consensus CAGGTAAGT.

But we have to go beyond independence…

Beyond independence l.jpg
Beyond independence

  • Weight array matrices, Zhang & Marr (1993) consider dependencies between adjacent positions, i.e. (non-stationary) first-order Markov models. The number of parameters increases exponentially if we restrict to full higher-order Markovian models.

  • Variable length Markov models, Rissanen (1986), Buhlmann & Wyner (1999), help us get over this problem. In the last few years, many variants have appeared: all make use of trees.

  • [The interpolated Markov models of Salzberg et al (1998) address the same problem.]

Some notation l.jpg
Some notation

  • L: length of the DNA sequence motif.

  • Xi: discrete random variable at position i, taking values from the set = {A, C, G, T}.

  • x= (x1, …, xL):a DNA sequence motifof length L.

  • x1j(j>1): the sequence (x1, x2, …., xj ).

  • P(x): probability of x.

Variable length markov models rissanen 1986 buhlmann wyner 1999 l.jpg
Variable Length Markov Models(Rissanen (1986), Buhlmann & Wyner (1999))

  • Factorize P(x) in the usual telescopic way:

  • Simplify this using context functions

  • l = 2,..L, to

Vlmm cont l.jpg
VLMM cont.

  • A VLMM for a DNA sequence motif of length L is specified by

  • a distribution for X1, and, for l = 2,…L,

  • a constrained conditional distribution for Xl given Xl-1,…,X1.

  • That is, we need L-1 context functions.

Slide44 l.jpg

VLMM: an illustrative example





X2 X1






X2 X1

Pruned context: P(X3|X2=A, X1) = P(X3|X2=A), etc.

Sequence dependencies interactions are not always local l.jpg
Sequence dependencies(interactions) are not always local

3-dimensional folding; DNA, RNA & protein interactions

The methods outlined so far all fail to incorporate long-range

( ≥4 bp) interactions. New model types are needed!

Modelling long range dependency l.jpg
Modelling “long-range” dependency

  • The principal work in this area is Burge & Karlin’s (1997)

  • maximal dependence decomposition(MDD) model.

  • Cai et al (2000) and Barash et al (2003) usedBayesian

  • networks (BN).

  • Yeo & Burge (2003) applied maximum entropy models.

  • Ellrott et al (2002) ordered the sequence and applied

  • Markov models to the motif.

  • We have adapted this last idea, to give permuted variable

  • length Markov models (PVLMM).

Slide47 l.jpg

Permuted variable length MMs


Let be a permutation vector.

By using VLMM, simplify the context terms in the equation:

The simplified equation has the form:

Part of a context decision tree for position 2 of a splice donor pvlmm l.jpg
Part of a context (decision) tree for position -2 of a splice donor PVLMM

Node #s: counts

Edge #s: split variables

Sequence order: +2 +5 -1 +4 -2 +3 -3

Maximal dependence decomposition l.jpg
Maximal Dependence Decomposition

  • MDD starts with a mutual independence model as with WMMs. The data are then iteratively subdivided, at each stage splitting on the mostdependent position, suitably defined. At the tips of the tree so defined, a mutual independence model across all remaining positions is used.

  • The details can vary according to the splitting criterion (Burge & Karlin used 2), the actual splits (binary, etc), and the stopping rule.

  • However, the result is always a single tree.

Slide50 l.jpg

Parts of MDD trees for splice donors

In each case, splits are into the most frequent nt vs the others.

Issues in statistical modelling of short dna motifs l.jpg
Issues in statistical modelling of short DNA motifs

  • In any study of this kind, essential items are:

  • the model class (e.g. WMM, PVLMM, or MDD)

  • the way we search through the model class (e.g. by forward selection & simulated annealing)

  • the way we compare models when searching (e.g. by AIC, BIC, or MDL), and finally,

  • the way we assess the final model in relation to our aims (e.g. by cross-validation).

  • We always need interesting, high-quality datasets.

Model assessment stand alone recognition l.jpg
Model assessment:Stand-alone recognition

  • M: motif model B:background model

  • Given a sequence x= (x1, …, xL), we predictx to be a motif if

  • log {P(x | M) / P(x | B)} > c,

  • for a suitably chosen threshold value c.

Model assessment terms l.jpg
Model assessment: terms

  • TP: true positives, FP: false positives

  • TN: true negatives, FN: false negatives

  • Sensitivity (sn) and specificity (sp) are:

  • sn = TP / [TP + FN] sp = TP / [TP + FP].

Transcription factor binding sites l.jpg
Transcription factor binding sites

  • These are of great interest, their signals are very weak, and we typically have only a few instances.

  • TRANSFAC (Wingender et al 2000).

  • We have studied 61 TFs with effective length ≤ 9 and ≥ 20 instances, 2,238 sites in all.

  • In 25/61 cases we are able to improve upon WMM.

  • In the remaining, PVLMM performs similarly to WMM.

Assessing our models for tfbs recognition l.jpg

Assessing our models for TFBS recognition

We randomly inserted each of the 2,238 sites into a background sequence of length 1,000 simulated from a stationary 3rd order MM, with parameters estimated from a large collection of human sequence upstream of genes.

10-fold cross-validation (CV) was used for each TF. At each round of CV, 90% of the true sites were used to fit the models (PVLMM, MDD, WMM). We then scanned the sequences containing the remaining 10% of the sites, and selected the N top-scoring sequences as putative binding sites, where N was the size of the remaining 10%. Thus we have sn = sp.

Three transcription factors l.jpg
Three transcription factors


wmm: 0.37

pvlmm: 0.54

mdd: 0.37

wmm: 0.38

pvlmm: 0.55

mdd: 0.45

wmm: 0.54

pvlmm: 0.72

mdd: 0.64

Splice donor sites l.jpg
Splice donor sites

  • Mammalian donor sequences from SpliceDB,Burset et al (2001).

  • 15,155 canonical donor sites of length 9, with GT conserved at position 0 and 1.

Mammalian donor site

Beginning of the splicing process l.jpg
Beginning of the splicing process

Splice donor



Slide60 l.jpg

Base-3 -2 -10 +1 +2 +3 +4 +5

A33 61 100 0 5371 7 16

C37 13 30 0 3 8 6 16

G 18 12 80100 0 42 12 81 22

T12 14 70 100 2 9 6 46

Interpretation of PVLMM model selected

Mammalian donor site

U1 snRNA G U C C A U U C A

Slide61 l.jpg

“Long-range” dependence in the model:

5’/3’ compensation

U1 sn RNA: G U C C A U U C A

Optimal permutation: +2 +3 +4 -1 -2 +5 -3

Integrating into a gene finder l.jpg
Integrating into a gene finder

  • SLAM (Pachter et al, 2002): a eukaryotic cross-species gene finder (generalized pair hidden Markov model (GPHMM)).

Results at the exon level

Slide63 l.jpg


  • Systematic framework for probabilistic modeling of short DNA motifs

  • Statistical modeling issues

  • Better prediction performance by modeling

  • local & non-local dependence

  • Good biological interpretations

Future work l.jpg
Future work

  • Model the dependence in protein motifs

  • Extend to find de novo motifs

  • Find a way to deal with indels

  • Incorporate localsequence information into PVLMM-based predictions

  • Jointly model multiple motifs in one species and sites across species

References l.jpg

  • Biological Sequence Analysis

  • R Durbin, S Eddy, A Krogh and G Mitchison

  • Cambridge University Press, 1998.

  • Bioinformatics The machine learning approach

  • P Baldi and S Brunak

  • The MIT Press, 1998

  • Post-Genome Informatics

  • M Kanehisa

  • Oxford University Press, 2000

  • Find short DNA motifs using permuted Markov models

  • X Zhao, H Huang and T Speed, RECOMB2004.

Acknowledgements l.jpg

  • Terry Speed, UCB & WEHI

  • Haiyan Huang, UCB

  • The SLAM team:

  • Simon Cawley, Affymetrix

  • Lior Pachter, UCB

  • Marina Alexandersson, FCC

  • Sourav Chatterji, UCB

  • Mauro Delorenzi (ISREC)

  • WEHI bioinformatics lab