gene recognition n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Gene Recognition PowerPoint Presentation
Download Presentation
Gene Recognition

Loading in 2 Seconds...

play fullscreen
1 / 31

Gene Recognition - PowerPoint PPT Presentation


  • 114 Views
  • Uploaded on

Gene Recognition. Gene structure. intron1. intron2. exon2. exon3. exon1. transcription. splicing. translation. Codon: A triplet of nucleotides that is converted to one amino acid. exon = protein-coding intron = non-coding. Needles in a Haystack. Gene Finding. EXON. EXON. EXON.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Gene Recognition' - maylin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
gene structure
Gene structure

intron1

intron2

exon2

exon3

exon1

transcription

splicing

translation

Codon:

A triplet of nucleotides that is converted to one amino acid

exon = protein-coding

intron = non-coding

gene finding
Gene Finding

EXON

EXON

EXON

EXON

EXON

  • Classes of Gene predictors
    • Ab initio
      • Only look at the genomic DNA of target genome
    • De novo
      • Target genome + aligned informant genome(s)
    • EST/cDNA-based & combined approaches
      • Use aligned ESTs or cDNAs + any other kind of evidence

Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta

Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta

Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca

Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca

Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta

Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta

Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cta

Armadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta

Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta

Tenrectc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta

Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg

Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg

signals for gene finding
Signals for Gene Finding
  • Regular gene structure
  • Exon/intron lengths
  • Codon composition
  • Motifs at the boundaries of exons, introns, etc.

Start codon, stop codon, splice sites

  • Patterns of conservation
  • Sequenced mRNAs
  • (PCR for verification)
slide6

Next Exon:

Frame 0

Next Exon:

Frame 1

nucleotide composition
Nucleotide Composition
  • Base composition in exons is characteristic due to the genetic code

Amino Acid SLCDNA Codons

Isoleucine I ATT, ATC, ATA

Leucine L CTT, CTC, CTA, CTG, TTA, TTG

Valine V GTT, GTC, GTA, GTG

Phenylalanine F TTT, TTC

Methionine M ATG

Cysteine C TGT, TGC

Alanine A GCT, GCC, GCA, GCG

Glycine G GGT, GGC, GGA, GGG

Proline P CCT, CCC, CCA, CCG

Threonine T ACT, ACC, ACA, ACG

Serine S TCT, TCC, TCA, TCG, AGT, AGC

Tyrosine Y TAT, TAC

Tryptophan W TGG

Glutamine Q CAA, CAG

Asparagine N AAT, AAC

Histidine H CAT, CAC

Glutamic acid E GAA, GAG

Aspartic acid D GAT, GAC

Lysine K AAA, AAG

Arginine R CGT, CGC, CGA, CGG, AGA, AGG

slide9

atg

caggtg

ggtgag

cagatg

ggtgag

cagttg

ggtgag

caggcc

ggtgag

tga

splice sites
Splice Sites

(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

hmms for gene recognition

intron

exon

exon

intron

intergene

exon

intergene

HMMs for Gene Recognition

First Exon State

Intron

State

Intergene State

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

hmms for gene recognition1

intron

exon

exon

intron

intergene

exon

intergene

HMMs for Gene Recognition

First Exon State

Intron

State

Intergene State

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

duration hmms for gene recognition

Exon1

Exon2

Exon3

Duration HMMs for Gene Recognition

PSTOP(xi-4…xi+3)

Duration d

j+2

T

A

A

T

A

T

G

T

C

C

A

C

G

G

G

T

A

T

T

G

A

G

C

A

T

T

G

T

A

C

A

C

G

G

G

G

T

A

T

T

G

A

G

C

A

T

G

T

A

A

T

G

A

A

iPINTRON(xi | xi-1…xi-w)

P5’SS(xi-3…xi+4)

PEXON_DUR(d)iPEXON((i – j + 2)%3)) (xi | xi-1…xi-w)

genscan
Genscan
  • Burge, 1997
  • First competitive HMM-based gene finder, huge accuracy jump
  • Only gene finder at the time, to predict partial genes and genes in both strands

Features

    • Duration HMM
    • Four different parameter sets
      • Very low, low, med, high GC-content
using comparative information1
Using Comparative Information
  • Hox cluster is an example where everything is conserved
patterns of conservation

Genes

Intergenic

Separation

Mutations

Gaps

Frameshifts

30%

1.3%

0.14%

58%

14%

10.2%

2-fold

10-fold

75-fold

Patterns of Conservation
comparison based gene finders
Comparison-based Gene Finders
  • Rosetta, 2000
  • CEM, 2000
    • First methods to apply comparative genomics (human-mouse) to improve gene prediction
  • Twinscan, 2001
    • First HMM for comparative gene prediction in two genomes
  • SLAM, 2002
    • Generalized pair-HMM for simultaneous alignment and gene prediction in two genomes
  • NSCAN, 2006
    • Best method to-date based on a phylo-HMM for multiple genome gene prediction
twinscan
Twinscan
  • Align the two sequences (eg. from human and mouse)
  • Mark each human base as gap ( - ), mismatch ( : ), match ( | )

New “alphabet”: 4 x 3 = 12 letters

  • = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }
  • Run Viterbi using emissions ek(b)

where b  { A-, A:, A|, …, T| }

Emission distributions ek(b) estimated from real genes from human/mouse

eI(x|) < eE(x|): matches favored in exons

eI(x-) > eE(x-): gaps (and mismatches) favored in introns

Example

Human: ACGGCGACGUGCACGU

Mouse: ACUGUGACGUGCACUU

Alignment: ||:|:|||||||||:|

slam generalized pair hmm
SLAM – Generalized Pair HMM

Exon GPHMM

1.Choose exon lengths (d,e).

2.Generate alignment of length d+e.

e

d

nscan multiple species gene prediction
NSCAN—Multiple Species Gene Prediction
  • GENSCAN
  • TWINSCAN
  • N-SCAN

Target GGTGAGGTGACCAAGAACGTGTTGACAGTA

Target GGTGAGGTGACCAAGAACGTGTTGACAGTA

Conservation |||:||:||:|||||:||||||||......

sequence

Target GGTGAGGTGACCAAGAACGTGTTGACAGTA

Informant1 GGTCAGC___CCAAGAACGTGTAG......

Informant2 GATCAGC___CCAAGAACGTGTAG......

Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA

...

Target sequence:

Informant sequences (vector):

Joint prediction (use phylo-HMM):

performance comparison
Performance Comparison

NSCAN human/mouse

>

Human/multiple informants

GENSCAN

Generalized HMM

Models human sequence

TWINSCAN

Generalized HMM

Models human/mouse alignments

N-SCAN

Phylo-HMM

Models multiple sequence evolution

contrast
CONTRAST
  • 2-level architecture
  • No Phylo-HMM that models alignments

Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta

Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta

Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca

Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca

Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta

Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta

Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cta

Armadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta

Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta

Tenrectc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta

Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg

Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg

X

SVM

SVM

CRF

a

b

a

b

Y

contrast features
CONTRAST - Features
  • log P(y | x) ~ wTF(x, y)
  • F(x, y) = if(yi-1, yi, i, x)
  • f(yi-1, yi, i, x):
    • 1{yi-1 = INTRON, yi = EXON_FRAME_1}
    • 1{yi-1 = EXON_FRAME_1, xhuman,i-2,…, xhuman,i+3 = ACCGGT)
    • 1{yi-1 = EXON_FRAME_1, xhuman,i-1,…, xdog,i+1 = ACC, AGC)
    • (1-c)1{a<SVM_DONOR(i)<b}
    • (optional) 1{EXON_FRAME_1, EST_EVIDENCE}
contrast svm accuracies
CONTRAST – SVM accuracies

SN

SP

  • Accuracy increases as we add informants
  • Diminishing returns after ~5 informants
contrast decoding
CONTRAST - Decoding

Viterbi Decoding:

maximize P(y | x)

Maximum Expected Boundary Accuracy Decoding:

maximize i,B 1{yi-1, yi is exon boundary B} Accuracy(yi-1, yi, B | x)

Accuracy(yi-1, yi, B | x) = P(yi-1, yi is B | x) – (1 – P(yi-1, yi is B | x))

contrast training
CONTRAST - Training

Maximum Conditional Likelihood Training:

maximize L(w) = Pw(y | x)

Maximum Expected Boundary Accuracy Training:

ExpectedBoundaryAccuracy(w) = i Accuracyi

Accuracyi = B1{(yi-1, yi is exon boundary B} Pw(yi-1, yi is B | x) -

B’ ≠ B P(yi-1, yi is exon boundary B’ | x)

performance comparison1
Performance Comparison

Human

Macaque

Mouse

Rat

Rabbit

Dog

Cow

Armadillo

Elephant

Tenrec

Opossum

Chicken

De Novo

EST-assisted