sequence analysis of nucleic acids and proteins part 2 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Sequence analysis of nucleic acids and proteins: part 2 PowerPoint Presentation
Download Presentation
Sequence analysis of nucleic acids and proteins: part 2

Loading in 2 Seconds...

play fullscreen
1 / 50

Sequence analysis of nucleic acids and proteins: part 2 - PowerPoint PPT Presentation


  • 120 Views
  • Uploaded on

Sequence analysis of nucleic acids and proteins: part 2. Prediction of structure and function. Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000. Search and learning problems in sequence analysis. Thermodynamic principle.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Sequence analysis of nucleic acids and proteins: part 2' - geri


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
sequence analysis of nucleic acids and proteins part 2

Sequence analysis of nucleic acids and proteins: part 2

Prediction of structure and function

Based on Chapter 3 of

Post-genome bioinformatics

by Minoru Kanehisa

Oxford University Press, 2000

thermodynamic principle
Thermodynamic principle

The amino acid sequence contains all the information necessary to fold a protein molecule into its native 3D state under physiological conditions: fold, denature, spontaneously refold, called Anfinsen’s thermodynamic principle

Thus it should be possible to predict 3D structure computationally by minimizing a suitable conformational energy function, but difficult to define, difficult to minimize (globally), called ab initio

In practice, structures determined by X-ray crystallography and nuclear magnetic resonance (NMR) are used to give empirical structure-function relationships.

a schematic illustration of rna secondary structure elements

RNA secondary structure can be predicted ab initio using an energy

function and DP to minimize it, in a process similar to alignment

A schematic illustration of RNA secondary structure elements.

Hairpin loop

Stem

Pseudo knot

Bulge loop

Internal loop

Branch loop

slide5

A

C

C

A

G.C

C.G

G.C

G.U

A.U

U.A

U.A C U

G ACAC A

G

C

Yeast alanyl transfer RNA

slide6

Prediction of protein secondary structure: many methods

The definition of a dihedral angle and the three backbone dihedral angles, f, y, w, in a protein. Because w is around 180O, the backbone configuration can be specified byf and y, for each peptide unit.

C’

f

Ca

C’

H

O

H

N

R

H

R

Ca

N

C’

Ca

y

f

w

N

C’

N

C’

Ca

H

H

O

R

H

O

Peptide unit

prediction of protein secondary structure
Prediction of protein secondary structure

The options are -helix, -strand and coil.

Many 2º structure prediction methods exist, with ones by Chou-Fasman and another due to Garnier,Osguthorpe and Robson being widely used. These are position&structure-specific scoring matrices based on modest or large numbers of proteins. On the next page we display the GOR PSSM for -helices.

These days one can choose from methods based on almost every major machine learning approach: ANN, HMM, etc.

slide9
Two architectures of the hierarchical neural network: (a) the perceptron and (b) the back-propagation neural network.

Input layer

Output layer

Input

Layer

Hidden

Layer

Output

Layer

prediction of transmembrane domains
Prediction of transmembrane domains

Membrane proteins are very common, perhaps 25% of all. Membranes are hydrophobic and so a transmembrane domain typically has hydrophobic residues, about 20 to span the membrane.

There are a number of rules for detecting them: Kyte-Doolittle hydropathy scores work fairly well, and the Klein-Kanehisa-DeLisi discriminant function does even better.

slide11

Three-dimensional structures of two membrane proteins

Photosynthetic reaction centre

(PDB:1PRC)

Outer membrane protein: porin

(PDB: 1OMF)

hidden markov models hmms
Hidden Markov Models (HMMs)

S = States {s0,s1,…..,sn}

V = Output alphabet {v0,v1,…..,vm}

A = { aij} = transition probability from si sj

B = {bi(j)} = probability outputting vj in state si

  • What is the probability of a sequence of observations?
  • What are the maximum likelihood estimates of parameters in an HMM?
  • What is the most likely sequence of states that produced a given sequence of observations?
a hidden markov model for sequence analysis

d1

d2

d3

d4

I0

I1

I2

I3

I4

m0

m1

m2

m3

m4

m5

End

Start

A hidden Markov model for sequence analysis

m=match state (output), I=insert state (output), d=delete state (no output)

prediction of protein 3d structures
Prediction of protein 3D structures

Knowledge based prediction of protein 3D or 3º structure can be classified into two categories: comparative modelling and fold recognition. The first can work well when there is significant sequence similarity to a protein with known 3D structure. By contrast, fold recognition is used when no significant sequence similarity exists, and makes use of the knowledge and analysis of all protein structures. One such method due to Eisenberg and colleagues, involves 3D-1Dalignment. Another such is threading.

slide15

P2

B3

B2

The 3D-1D method for prediction of protein 3D structures involves the construction of a library of 3D profiles for the known protein structures.

Main chain

Side chain

Inside or outside

a

E

b

P1

Polar or apolar

B1

Environmental class

Residue number

B1a B1b B1 . . . .

1 2 3 . . . . . . . . . .

N

A

R

.

.

.

.

.

Y

W

-0.66 -0.79 -0.91 . . . .

-1.67 -1.16 -2.16 . . . .

. . .

. . .

. . .

. . .

. . .

0.18 0.07 0.17 . . . .

1.00 1.17 1.05 . . . .

A

R

.

.

.

.

.

Y

W

12 -66 46 . . . . . . . . . .

-32 -80 -34 . . . . . . . . . .

. . .

. . .

. . .

. . .

. . .

-94 112 -210 . . . . . . . . . .

-214 102 -135 . . . . . . . . . .

Amino acids

3D-1D score

3D profile

gene structure i
Gene Structure I

DNA - - - - agacgagataaatcgattacagtca - - - -

Transcription

RNA - - - - agacgagauaaaucgauuacaguca - - - -

Splicing

Translation

Protein - - - - - DEI - - - -

Exon Intron Exon Intron Exon

Protein Folding

Problem

Protein

gene structure ii
Gene Structure II

Exon 1

Exon 2

Exon 3

Exon 4

Intron 1

Intron 2

Intron 3

5’

3’

DNA

TRANSCRIPTION

pre-mRNA

SPLICING

mRNA

TRANSLATION

AUG - X1…Xn - STOP

protein sequence

protein 3D structure

gene structure iii
Gene Structure III

Exon 1

Exon 2

Exon 3

Exon 4

DNA

Intron 1

Intron 2

Intron 3

5’

3’

Promoter

TATA

Splice site

GGTGAG

Pyrimidine

tract

polyA signal

Splice site

CAG

Translation

Initiation

ATG

Branchpoint

CTGAC

Stop codon

TAG/TGA/TAA

additional difficulties
Additional Difficulties

pre-mRNA

  • Alternative splicing

ALTERNATIVE

SPLICING

SPLICING

mRNA

TRANSLATION

TRANSLATION

Protein I

Protein II

  • Pseudo genes

DNA

approaches to gene recognition
Approaches to Gene Recognition
  • Homology

BLASTN, TBLASTX,

Procrustes

  • Statistical de novo

GRAIL, FGENEH, Genscan, Genie, Glimmer

  • Hybrid

GenomeScan, Genie

F(*,*,*,…)

example glimmer gene finding in microbial dna
Example: GlimmerGene Finding in Microbial DNA
  • No introns
  • 90% coding
  • Shorter genomes (less than 10 million bp)
  • Lots of data
slide22

Gene Structure in Prokaryotes

ORF

Translation

Initiation

ATG

Stop codon

TAG/TGA/TAA

simplest hidden markov gene model
Simplest Hidden Markov Gene Model

A 0.9

C 0.03

G 0.04

T 0.03

Coding

1

0.1

0.9

ATG

TAA

1

0.1

Intergene

A 0.25

C 0.25

G 0.25

T 0.25

0.9

the viterbi algorithm
The Viterbi Algorithm

A A C A G T G A C T C T

example genscan gene finding in human dna
Example: GenscanGene Finding in Human DNA
  • Introns
  • 5% coding
  • Large genome (3 billion bp)
  • Alternative splicing
protein sorting prediction
Protein sorting prediction

The final step in informational expression of proteins involves their sorting to the appropriate location within or outside the cell. The information for correct localization is usually located within the protein itself.

slide31

Sequence Alignment Problem

  • Task:find common patterns shared by multiple Protein sequences
  • Importance:understanding function and structures; revealing evolutionary relationship, data organizing …
  • Types:Pairwisevs.Multiple; Globalvs.Local.
  • Approaches:criteria-based (extension of pairwise methods) versus model-based (EM, Gibbs, HMM)
outline of liu lawrence approach
Outline of Liu-Lawrence approach
  • Local alignment --- Examples, the Gibbs sampling algorithm
  • A simple multinomial model for block-motifs and the Bayesian missing-data formulation.

Possible but not covered here:

  • Motif sampler: repeated motifs.
  • The hidden Markov model (its decoupling)
  • The propagation model and beyond
example search for regulatory binding sites
Example: search for regulatory binding sites
  • Gene Transcription and Regulation
    • Transcription initiated by RNA polymerase binding at the so-called promoter region (TATA-box; or -10, -35)
    • Regulated by some (regulatory) proteinson DNA “near” the promoter region.
    • These binding sites on DNA are often “similar” in composition.

RNA

polymerase

Enhancers and repressors

Starting codon

3’

5’

AUG

Promoter region

Translation start

the particular dataset
The particular dataset
  • 18 DNA segments, each of length 105 bps.
  • There are at least one CRP binding sites, known experimentally, in each sequence.
  • The binding sites are about 16-19 base pairs long, with considerable variability in their contents.
  • Interested in seeing if we can find these sites computationally.
example h t h proteins
Example: H-T-H proteins
  • HTH: sequence-specific DNA binding, gene regulation.
  • Motifs occur as local isolated structures. The whole 3-D structures are known and very different.
  • 30 sequences with known HTH positions chosen. The set represents a typically diverse cross section of HTH seq.
  • Width of the motif pattern is assumed to be in the range from 17 to 22. The criterion “information per parameter” is used to determine the optimal width, 21.
  • Heuristic convergence developed (multiple restarts with IPP monitored)
  • Finding
slide40

Local Alignment of Multiple Sequences

Local

Motif

a1

a2

width = w

ak

length nk

Alignment variable:A={a1, a2, …, ak}

Objective:find the “best” common patterns.

motif alignment model
Motif Alignment Model

Motif

a1

a2

width = w

ak

length nk

The missing data: Alignment variable: A={a1, a2, …, ak}

  • Every non-site positions follows a common multinomial
  • with p0=(p0,1 ,…, p0,20)
  • Every position i in the motif element follows probability
  • distribution pi=(pi,1 ,…, pi,20)
the tricky part the alignment variable a a 1 a 2 a k is not observable
The Tricky Part: The alignment variableA={a1, a2, …, ak}is not observable
  • General Missing Data problem:
    • Unobserved data in each datum
    • Object of the DP optimization (path)
    • Potentially observable
    • Examples
      • Alignment
      • RNA structure
      • Protein secondary structure
statistical models
Statistical Models
  • How do we describe patterns?
    • frequencies of amino acid types.
    • multinomial distribution --- more generally a “model”

A typical

aligned motif

multinomial distribution
Multinomial Distribution

A total of

k sequences

Model Mi for i-th column:

(ki,1, ki,2, …, ki,20) ~ Multinom (k, pi)

wherepi=(pi,1 ,…, pi,20)

estimation for the pattern
Estimation for the “pattern”
  • The maximum likelihood:
  • Bayesian estimate:
    • Prior: pi ~ Dirichlet (ai,1, ..., ai,20),“pseudo-counts”
    • Posterior: [pi | obs ]~ Dirichlet (ai,1,+ki,1,…, ai,20 +ki,20)
    • Posterior Mean:
    • Posterior Distribution:
dealing with the missing data

a1

a2

a3

ak?

Dealing with the missing data
  • Let Q=(p0,p1 , … , pw), “parameter”, A={a1, a2, …, aK}
  • Iterative sampling: P(Q | A, Data); P(A | Q, Data)
    • Draw from [Q | A, Data], then draw from [A | Q, Data]
  • Predictive Updating:pretend that K-1sequences have been aligned. We stochastically predict for the K-th sequence!!
the algorithm
The Algorithm
  • Initialized by choosing random starting positions
  • Iterate the following steps many times:
    • Randomly or systematically choose a sequence, say, sequence k, to exclude.
    • Carry out the predictive-updating step to update ak
  • Stop when not much change observed, or some criterion met.
the pu step
The PU-Step

a1

a2

a3

ak?

1. Compute predictive frequencies of each position i in motif

cij= count of amino acid typejat positioni.

c0j = count of amino acid type j in all non-site positions.

qij=(cij+bj)/(K-1+B),B=b1+ · · ·+ bK “pseudo-counts”

2. Sample from the predictive distriubtion ofak.

phase shift and fragmentation

: True motif locations

ak?

Phase-shift and Fragmentation
  • Sometimes get stuck in a local shift optimum
  • How to “escape” from this local optimum?
    • Simultaneous move: A ®A+d, A+d={a1+d, … , aK+d}
    • Use a Metropolis step: accept the move with prob=p,

Compare entropies between

new columns and left-out ones.

acknowledgements for slides used
Acknowledgementsfor slides used

PDB: protein figures

Lior Pachter: gene finding

Jun Liu: Gibbs sampler