Cs5238 combinatorial methods in bioinformatics
Download
1 / 53

CS5238 Combinatorial methods in bioinformatics - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

CS5238 Combinatorial methods in bioinformatics. Topic: Gene Finding – Promoter Recognition Cen Cen, Er Inn Inn, Miao Xiaoping, Piyush Kanti Bhunre, Yin Jun. 1 November 2002. Outline of Presentation. Biological Background Gene Finding Promoter Recognition Dragon Promoter Finder

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS5238 Combinatorial methods in bioinformatics' - torn


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cs5238 combinatorial methods in bioinformatics

CS5238 Combinatorial methods in bioinformatics

Topic: Gene Finding –

Promoter Recognition

Cen Cen, Er Inn Inn, Miao Xiaoping,

Piyush Kanti Bhunre, Yin Jun

1 November 2002


Outline of presentation
Outline of Presentation

  • Biological Background

  • Gene Finding

  • Promoter Recognition

  • Dragon Promoter Finder

  • Open Problem and Future Research

  • New Algorithm

  • Conclusion


Biological background
Biological Background

  • What is gene?

    • A sequence of DNA that encodes a protein or an RNA molecule.

    • Gene has 4 regions: Coding region, 5’ UTR, 3’ UTR and regulatory region (promoter – regulate the transcription process)

    • Human genome – 3G bp, but only 3% is coding region.


Central dogma
Central Dogma

  • Central Dogma- process where DNA sequence generates a protein

    • Transcription & Translation

  • Promoter – responsible for initiation and regulation of transcription

  • RNA-polymerase binds to a TATA base sequence in promoter region



Promoter region
Promoter Region

  • Core Promoter –

    • TATA-box

    • Initiator (Inr)

    • Downstream promoter element

  • 3 types of core promoter

    • TATA-box

    • TATA-less, Inr-containing

    • Inr + DPE

  • Upstream promoter elements

  • TSS -where transcription starts on DNA

The biology of eukaryotic promoter prediction – a review by Pedersen, A.G. et. al.


Outline of Presentation

  • Biological Background

  • Gene Finding

  • Promoter Recognition

  • Dragon Promoter Finder

  • Open Problem and Future Research

  • New Algorithm

  • Conclusion


What is gene finding
What is Gene Finding?

  • Generate predictions of gene locations from primary genomic sequence (DNA sequence) by computational methods.

  • Task of gene finding – separate the coding regions, non-coding regions and intergenic regions.

    • Input: A seq of DNA, X = x1x2x…xn, where xi belongs to {A, C, G, T}

    • Output: Correct labeling of each element in X as a belonging to CR, NCR, Intergenic Region


Gene finding
Gene Finding

  • 3 major kinds of gene finding strategies:

    • Content-based – overall properties of the sequence when making predictions

    • Site-based – make use of presence or absence of a specific sequence, pattern or consensus

    • Comparative – sequence homology (database searching)

  • Combinatorial approach - GeneMachine

  • GRAIL, FGENEH, MZEF, GenScan, GeneID, GeneParser, HMMgene and so on.


Gene finding open problems
Gene Finding – Open Problems

  • Overlapping genes – no existing method that can deal with this problem

  • Alternative splicing, alternative transcription/translation problem

  • Sequencing errors

  • Difficult to identify promoter region (PR) & polyA (high true pos + high false pos)


Outline of Presentation

  • Biological Background

  • Gene Finding

  • Promoter Recognition

  • Dragon Promoter Finder

  • Open Problem and Future Research

  • New Algorithm

  • Conclusion


Promoter recognition
Promoter Recognition

  • Accurate PR can help to:

    • Detect a respective gene more easily

    • Determine the 5’ ends of the respective gene more precisely

    • Localize the regions that contain numerous different transcription control components

  • Developing a perfect predictive model of PR is challenging


Main approach to pr
Main Approach to PR

  • Pattern-driven strategy

    • Collect a set of real binding sites to build characteristics definition, representation, pattern or profile from them

    • Recognition of individual potential binding sites by using their characteristic profiles

    • Assembling the candidates’ binding sites following some descriptions and rules about how these arrangements should be done.


Problem
Problem:

  • Given a collection of known binding sites, how to develop a representation of those sites, which is useful to search for them in new sequence?

    • Consensus sequences

    • Positional Weight Matrices (PWM)

    • Hidden Markov profiles

    • Multilayer neural networks and so on


Promoter recognition program
Promoter Recognition Program

  • Statistical approach + artificial intelligence techniques -

    • Dragon Promoter Finder (DPF)

    • PromoterInspector

    • Promoter 2.0


Accuracy metric for pr
Accuracy Metric for PR

A common measure of prediction accuracy

Sensitivity Specificity

TP TN

SE = ——— SP = ———

TP + FN TN + FP

  • Evaluation largely influenced by training set and test sets


Prediction of promoter
Prediction of Promoter

2 x 2 contingency table


Example of prediction dpf
Example of Prediction - DPF

Promoter positions - exact positions of the TSS

2360, 2585, 4125, 5026, 5734, 7090,8567, 10641,

-2700, -12561, -12855

PREDICTED TRANSCRIPTION START SITES:

gi_59865_emb_X02138.1_HEHSV1SU Herpes simplex virus type 1 _HSV1_ short unique region DNA

Sequence length: 12979 # of bases: A=2286, C=4271, G=4078, T=2344

Predicted TSS

Forward strand

4125 5733 7093 8567 10641

# of guesses = 5

Reverse complement strand

-12561 -2698

# of guesses = 2


Measurement dragon promoter finder bic krdl singapore
MeasurementDragon Promoter Finder, BIC-KRDL Singapore

SE = 7/11 = 0.64

SP = 6479/6479 = 1


Outline of Presentation

  • Biological Background

  • Gene Finding

  • Promoter Recognition

  • Dragon Promoter Finder

  • Open Problem and Future Research

  • New Algorithm

  • Conclusion


Dragon promoter finder introduction
Dragon Promoter Finder -Introduction

  • Dragon Promoter Finder( DPF)

    • locates RNA polymerase II promoters in DNA sequences of vertebrates

    • predicts Transcription Start Site (TSS) positions.

  • strand specific

  • Components:

    • nonlinear promoter recognition models

    • signal procession

    • artificial neural networks (ANNs )

    • sensors.


Introduction cont
Introduction (cont)

  • The latest version

    • Dragon Promoter Finder Ver. 1.3

  • Main difference in new version

    • models are now specialized for C+G-rich and for C+G-poor sequences.


Structure
Structure

  • Overall Model

    • comprises a collection of a number of basic models

  • Basic Model

    • made up of two sub-models, A and B

    • trained for different ranges of system sensitivity

    • trained separately for the best performance.

  • Sub-Model



Basic model
Basic Model

  • A composite collection of basic models

    • Possess identical structure

    • Trained for narrow specificity range.

    • Data procession in each model is analogous.




Sub model1
Sub-model

  • Three Sensors

    • Specific functional regions of a gene: promoter, coding-exon, intron

    • Represented as positional distributions of overlapping pentamers

  • ANNs


Sensors
Sensors

  • Pentamers :

    • All sequences of 5 consecutive nucleotides.

    • AAAAA,AAAAC,AAAAG…… 4^5=1024 pentamers

    • Selected the most significant 256 pentamers from 1024 pentamers according to statistical relevance

  • Positional weight matrices (PWM):

    • The positional distribution of selected pentamers

    • Generate PWMs for each of the 3 functional groups, promoter, exon & intron, by counting the frequencies of all selected pentamers at each position.


How to analyze the content of a data window:

  • Sequence W=n1n2…nL-1nL, ni belongs to{A, C, G, T}

  • Sequence P of successive overlapping pentamers pj:P = p1p2… pL–5pL–4.

S = score for each data window

The higher the s, the more likely the data window represents the respective functional region.

These scores are input to nonlinear signal processing block (SPB)

Output from SPB is then input to ANN

: The jth pentamer at position i

: The frequency of the jth pentamer at position i


ANNs

  • Inputs: scores (outputs of sensors)

  • A multi-sensor integration.

  • Trained by the Bayesian regularization method to separate promoter regions from the non-promoter regions.

  • The threshold that best separated promoters from non promoter was selected

  • ANN output > threshold promoter region + TSS at a position 50bp before the data window’s end


Evaluation
Evaluation

  • Successfully recognize both CpG island-related and CpG island-nonrelated promoters.

  • Its performance on several large sets(A,B,and human chromosome 22) is reasonably consistent

  • On the average, its expected maximum sensitivities is approximately 66 percent.

  • In general, the DPF produces many times fewer FP predictions than comparative systems at the same sensitivity level.



Outline of Presentation

  • Biological Background

  • Gene Finding

  • Promoter Recognition

  • Dragon Promoter Finder

  • Open Problem and Future Research

  • New Algorithm

  • Conclusion


Open problem future research
Open Problem & Future Research

  • Open problem:

    • Lack of biological information on transcription process

    • Characteristics of promoter -> low ratio of accuracy

  • Future research work:

    • Designing specific algorithm for either classes of promoters or species-specific promoters

    • Comparative sequence analysis

    • Combinatorial approach

    • Data mining tools


Outline of Presentation

  • Biological Background

  • Gene Finding

  • Promoter Recognition

  • Dragon Promoter Finder

  • Open Problem and Future Research

  • New Algorithm

  • Conclusion


Gene recognition algorithm

Gene Recognition Algorithm

Using Dynamic Programming Approach

Presented by: Yin Jun


Dynamic programming algorithm
Dynamic Programming Algorithm

Existing Dynamic Programming Algorithm for Gene Finding

  • Snyder and Stormo’s method

    • GeneParser

  • Solovyev et al’s method

    • FGENEH

  • MORGAN’s DP algorithm


Goal of those algorithm
Goal of those Algorithm

  • Divide DNA sequence into alternate intron and exon regions.

  • Define a score for each kind of division. Try to find a kind of division which has the maximum score. The higher the score, the better the division.


Advantage and disadvantage of snyder and stormo s algorithm
Advantage and Disadvantage of Snyder and Stormo’s algorithm

  • Advantage

    • the donor and the acceptor site

    • HMM hidden status

  • Disadvantage

    • Cannot recognize promoter

    • 3-mer based


Our algorithm
Our Algorithm algorithm

  • Combine the ideas of “Dragon Promoter Finder” and “Snyder and Stormo’s algorithm”

  • Can deal with promoters

  • Use pentamer instead of 3-mer, more efficient

  • Dynamic Programming


Training phase
Training Phase algorithm

  • Pentamer – 5 consecutive bases

    • For example: “ACGGT”

    • There are 45=1024 different kind of pentamers

  • Divide a DNA sequence into pentamers

  • From training data, we can obtain the probability for each kind of pentamer to become a promoter, an intron or an exon



Principle of division 1
Principle of Division (1) algorithm

  • Good (red: promoter; green: intron; blue: exon)

  • Bad (low sum of probability)

C

C

A

B

B

C

B

A

D

D

D

C

C

A

B

B

C

B

A

D

D

D


Principle of division 2
Principle of Division (2) algorithm

  • Good (red: promoter; green: intron; blue: exon)

  • Bad (too frequent mutation)

C

C

A

B

B

C

B

A

D

D

D

C

C

A

B

B

C

B

A

D

D

D


Mutation penalty
Mutation Penalty algorithm

  • M(x, x) should be 0, x∈ {1, 2, 3}

    • 1: promoter

    • 2: intron

    • 3: exon

  • Example


Notation
Notation algorithm

  • P(p, r) – Probability for pentamer p belongs region r

    • Obtain from training data

  • M(s, t) – Mutation penalty

    • Parameters to specify

  • pi (1≤i≤n) – The i th pentamer in the DNA sequence

    • Input data (testing data)

  • a(pi) – Region assignment result; a(pi)∈{1, 2, 3}

    • Output data


Score function
Score Function algorithm

  • For division assignment a, its score is

  • We use dynamic programming algorithm to find the best division assignment, whose score is the highest


Bases
Bases algorithm

  • Let F(i, j, s, t) be the optimal score for the consecutive segment of pentamers from i th to j th, where i th pentamer is assigned region s, j th pentamer is assigned region t

  • Bases


Recursive definition
Recursive Definition algorithm

  • Recursive Definition

  • Finally, we get F(1, n, s, t) where s, t ∈{1, 2, 3}

  • Pick up the highest score from the 9 scores


Time complexity
Time Complexity algorithm

  • There are 9n2/2=O(n2) entries in the dynamic programming table

  • Filling each entry needs average n/2=O(n) time

  • The total time complexity is O(n3)


Outline of Presentation algorithm

  • Biological Background

  • Gene Finding

  • Promoter Recognition

  • Dragon Promoter Finder

  • Open Problem and Future Research

  • New Algorithm

  • Conclusion


Conclusion
Conclusion algorithm

  • Significant achievement in promoter recognition technique & algorithms contributes to majoradvances in gene finding.

  • There is still room for improvement in promoter recognition.

  • A new algorithm is proposed for gene recognition.


ad