Gene feature recognition
This presentation is the property of its rightful owner.
Sponsored Links
1 / 40

Gene Feature Recognition PowerPoint PPT Presentation


  • 44 Views
  • Uploaded on
  • Presentation posted in: General

For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician. Gene Feature Recognition. Limsoon Wong. Recognition of Splice Sites. A simple example to start the day . Splice Sites. Donor. Acceptor. Image credit: Xu. Acceptor Site (Human Genome).

Download Presentation

Gene Feature Recognition

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Gene feature recognition

For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician

Gene Feature Recognition

Limsoon Wong


Recognition of splice sites

Recognition of Splice Sites

A simple example to start the day 


Splice sites

Splice Sites

Donor

Acceptor


Acceptor site human genome

Image credit: Xu

Acceptor Site (Human Genome)

  • If we align all known acceptor sites (with their splice junction site aligned), we have the following nucleotide distribution

  • Acceptor site: CAG | TAG | coding region


Donor site human genome

Image credit: Xu

Donor Site (Human Genome)

  • If we align all known donor sites (with their splice junction site aligned), we have the following nucleotide distribution

  • Donor site: coding region | GT


What positions have high information content

What Positions Have “High” Information Content?

  • For a weight matrix, information content of each column is calculated as

    – X{A,C,G,T}Prob(X)*log (Prob(X)/0.25)

  • When a column has evenly distributed nucleotides, its information content is lowest

  • Only need to look at positions having high information content


Information content around donor sites in human genome

Image credit: Xu

Information Content Around Donor Sites in Human Genome

  • Information content

    • column –3 = – .34*log (.34/.25) – .363*log (.363/.25) – .183* log (.183/.25) – .114* log (.114/.25) = 0.04

    • column –1 = – .092*log (.92/.25) – .03*log (.033/.25) – .803* log (.803/.25) – .073* log (.73/.25) = 0.30


Weight matrix model for splice sites

Image credit: Xu

Weight Matrix Model for Splice Sites

  • Weight matrix model

    • build a weight matrix for donor, acceptor, translation start site, respectively

    • use positions of high information content


Splice site prediction a procedure

Image credit: Xu

Splice Site Prediction: A Procedure

  • Add up freq of corr letter in corr positions:

  • Make prediction on splice site based on some threshold

AAGGTAAGT: .34 + .60 + .80 +1.0 + 1.0 + .52 + .71 + .81 + .46 = 6.24

TGTGTCTCA: .11 + .12 + .03 +1.0 + 1.0 + .02 + .07 + .05 + .16 = 2.56


Recognition of translation initiation sites

Recognition of Translation Initiation Sites

An introduction to the World’s simplest TIS recognition system

A simple approach to accuracy and understandability


Translation initiation site

Translation Initiation Site


A sample cdna

A Sample cDNA

  • What makes the second ATG the TIS?

299 HSU27655.1 CAT U27655 Homo sapiens

CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80

CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160

GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240

CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

............................................................ 80

................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE


Approach

Approach

  • Training data gathering

  • Signal generation

    • k-grams, distance, domain know-how, ...

  • Signal selection

    • Entropy, 2, CFS, t-test, domain know-how...

  • Signal integration

    • SVM, ANN, PCL, CART, C4.5, kNN, ...


Training testing data

Training & Testing Data

  • Vertebrate dataset of Pedersen & Nielsen [ISMB’97]

  • 3312 sequences

  • 13503 ATG sites

  • 3312 (24.5%) are TIS

  • 10191 (75.5%) are non-TIS

  • Use for 3-fold x-validation expts


Signal generation

Signal Generation

  • K-grams (ie., k consecutive letters)

    • K = 1, 2, 3, 4, 5, …

    • Window size vs. fixed position

    • Up-stream, downstream vs. any where in window

    • In-frame vs. any frame


Signal generation an example

Signal Generation: An Example

299 HSU27655.1 CAT U27655 Homo sapiens

CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80

CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160

GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240

CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

  • Window = 100 bases

  • In-frame, downstream

    • GCT = 1, TTT = 1, ATG = 1…

  • Any-frame, downstream

    • GCT = 3, TTT = 2, ATG = 2…

  • In-frame, upstream

    • GCT = 2, TTT = 0, ATG = 0, ...


An example file resulting from feature generation

An Example File Resulting From Feature Generation


Too many signals

Too Many Signals

  • For each value of k, there are 4k * 3 * 2 k-grams

  • If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features!

  • This is too many for most machine learning algorithms


Signal selection basic idea

Signal Selection (Basic Idea)

  • Choose a signal w/ low intra-class distance

  • Choose a signal w/ high inter-class distance


Signal selection eg t statistics

Signal Selection (eg., t-statistics)


Signal selection eg 2

Signal Selection (eg., 2)


Signal selection eg cfs

Signal Selection (eg., CFS)

  • Instead of scoring individual signals, how about scoring a group of signals as a whole?

  • CFS

    • Correlation-based Feature Selection

    • A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other


Sample k grams selected by cfs

Sample k-grams Selected by CFS

  • Position –3

  • in-frame upstream ATG

  • in-frame downstream

    • TAA, TAG, TGA,

    • CTG, GAC, GAG, and GCC

Leaky scanning

Kozak consensus

Stop codon

Codon bias?


Signal integration

Signal Integration

  • kNN

    • Given a test sample, find the k training samples that are most similar to it. Let the majority class win

  • SVM

    • Given a group of training samples from two classes, determine a separating plane that maximises the margin of error

  • Naïve Bayes, ANN, C4.5, ...


Illustration of knn k 8

Neighborhood

5 of class

3 of class

=

Illustration of kNN (k=8)

Image credit: Zaki

Typical “distance” measure =


Using weka for tis prediction

Using WEKA for TIS Prediction


Results 3 fold x validation

Results (3-fold x-validation)

* Using top 20 2-selected features from amino-acid features


Validation results on chr x and chr 21

Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s

Our

method

ATGpr

Validation Results (on Chr X and Chr 21)


Technique comparisons

Pedersen&Nielsen [ISMB’97]

85% accuracy

Neural network

No explicit features

Zien [Bioinformatics’00]

88% accuracy

SVM+kernel engineering

No explicit features

Hatzigeorgiou [Bioinformatics’02]

94% accuracy (with scanning rule)

Multiple neural networks

No explicit features

Our approach

89% accuracy (94% with scanning rule)

Explicit feature generation

Explicit feature selection

Use any machine learning method w/o any form of complicated tuning

Technique Comparisons


Recognition of transcription start sites

Recognition of Transcription Start Sites

An introduction to the World’s best TSS recognition system

A heavy tuning approach


Transcription start site

Transcription Start Site


Structure of dragon promoter finder

-200 to +50

window size

Model selected based

on desired sensitivity

Structure of Dragon Promoter Finder


Each model has two submodels based on gc content

GC-rich submodel

#C + #G

Window Size

(C+G) =

GC-poor submodel

Each model has two submodels based on GC content


Data analysis within submodel

sp

se

si

K-gram (k = 5) positional weight matrix

Data Analysis Within Submodel


Promoter exon intron sensors

Pentamer at ith

position in input

Window size

s

jth pentamer at

ith position in

training window

Frequency of jth

pentamer at ith position

in training window

Promoter, Exon, Intron Sensors

  • These sensors are positional weight matrices of k-grams, k = 5 (aka pentamers)

  • They are calculated as s below using promoter, exon, intron data respectively


Data preprocessing ann

Simple feedforward ANN

trained by the Bayesian

regularisation method

Tuning parameters

wi

Tuned

threshold

sE

tanh(net)

sI

sIE

ex- e-x

ex+ e-x

tanh(x) =

net =  si * wi

Data Preprocessing & ANN


Accuracy comparisons

with C+G submodels

without C+G submodels

Accuracy Comparisons


Notes

Notes


References tis recognition

References (TIS Recognition)

  • A. G. Pedersen, H. Nielsen, “Neural network prediction of translation initiation sites in eukaryotes”, ISMB 5:226--233, 1997

  • H.Liu, L. Wong, “Data Mining Tools for Biological Sequences”, Journal of Bioinformatics and Computational Biology, 1(1):139--168, 2003

  • A. Zien et al., “Engineering support vector machine kernels that recognize translation initiation sites”, Bioinformatics 16:799--807, 2000

  • A. G. Hatzigeorgiou, “Translation initiation start prediction in human cDNAs with high accuracy”, Bioinformatics 18:343--350, 2002


References tss recognition

References (TSS Recognition)

  • V. B. Bajic et al., “Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates”, J. Mol. Graph. & Mod. 21:323--332, 2003

  • J. W. Fickett, A. G. Hatzigeorgiou, “Eukaryotic promoter recognition”, Gen. Res. 7:861--878, 1997

  • A. G. Pedersen et al., “The biology of eukaryotic promoter prediction---a review”, Computer & Chemistry 23:191--207, 1999

  • M. Scherf et al., “Highly specific localisation of promoter regions in large genome sequences by PromoterInspector”, JMB 297:599--606, 2000


  • Login