gene prediction n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Gene Prediction PowerPoint Presentation
Download Presentation
Gene Prediction

Loading in 2 Seconds...

play fullscreen
1 / 49

Gene Prediction - PowerPoint PPT Presentation


  • 138 Views
  • Uploaded on

Gene Prediction. Preliminary Results Computational Genomics February 20, 2012. ab initio Gene Prediction. Using Glimmer3, RAST, Prodigal and GenemarkS. Prodigal. lack of complexity(no Hidden Markov Model, no Interpolated Markov Model). based on dynamic programming.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Gene Prediction' - zuri


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
gene prediction
Gene Prediction

PreliminaryResults

ComputationalGenomicsFebruary20, 2012

ab initio gene prediction

ab initio Gene Prediction

Using Glimmer3, RAST, Prodigal and GenemarkS

prodigal
Prodigal
  • lack of complexity(no Hidden Markov Model, no Interpolated Markov Model).
  • based on dynamic programming.
  • remains accuracy in high GC content genomes.
  • tends to predict longer genes rather than more genes.
slide10

GeneMarkS

Gene prediction in Prokaryotic genome with unsupervised model parameter estimation

slide13

Command line version

  • Syntax:
  • runGeneMarkS <input_file> <output folder>
  • The Output folder contains 3 types of files:
  • .out file: contains the default output
  • .faa file: contains the amino acid sequence of the corresponding ORFs in FASTA format
  • .fnn file: contains the nucleotide sequence of the corresponding ORFs in FASTA format
slide14

Screenshot of the .out file

Strand +:normal strand, -:reverse strand

Left end: Begin position, Right end: End position

glimmer3
Glimmer3
  • A system for finding genes in microbial DNA
  • Works by creating a variable-length Markov model from a training set of genes
  • Using the model to identify all genes in a DNA sequence
running glimmer3
Running Glimmer3
  • 2 step progress
  • 1. A probability model of coding sequences must be built called an interpolated context model.
    • a set of training sequences
    • 1. genes identified by homology or known genes
    • 2. from long, overlapping orfs
    • 3. genes from a highly similar species
  • 2. program is run to analyze the sequences and make gene predictions
    • Best results require longest possible training set of genes
glimmer3 programs
Glimmer3 programs
  • Long-orfs uses an amino-acid distribution model to filter the set of orfs
  • Extract builds training set from long, nonoverlappingorfs
  • Build-icm build interpolated context model from training sequences
  • Glimmer3 analyze sequences and make predictions
slide23
RAST
  • RAST (Rapid Annotation using Subsystem Technology) is a system for annotating bacterial and archaeal genomes.
  • Pipelines- tRNAScan-SE, Glimmer2, and comparing against other prokaryote genes that are universal across species.
slide27

Homology-based

Gene Prediction using BLAT

slide29

Homology-based Gene Prediction using BLAT

1709

Protein coding genes

Haemophilusinfluenzae

Query

Haemophilushaemolyticus

Targets

Blat-UCSC

99

17

29

24

49

31

M19107.fasta

M19501.fasta

M21127.fasta

M21621.fasta

M21639.fasta

M21709.fasta

Predicted genes

Output.pslx

QueryCoverage

(%)

Frequency graphs

Define cutoff

slide30

Cut-off

Frequency

Query-Coverage %

slide32

Gene Calling Protocol

N° of Predicted Genes (≥ 90% Query-coverage)

787

1063

901

970

930

1515

Gene Scoring System

M19107

M19501

Presence / Absence

M21709*

M21127

M21621

M21639

?

= 3/5

≥ 4/5

≤ 2/5

Multiple Alignment (Muscle)

Final set of homology- based predicted genes

Consensus Sequence

slide34

tRNAScan SE

  • First pass filters identify "candidate" tRNAregions of the sequence.
  • tRNAscanand EufindtRNA
  • Further analysis to confirm the initial tRNAprediction.
  • Cove
slide35

Parameters passed

tRNAscan-SE –B <inputfile> -o <outputfile1> -f <outputfile2> -m <outputfile3>

  • -B <file> : search for bacterial tRNAs
  • This option selects the bacterial covariace model for tRNA analysis, and loosens the search parameters for EufindtRNA to improve detection o f bacterial tRNAs.
  • -o <file> : save final results in <file>
  • Specifiythis option to write results to <file>.
  • -f <file> : save results and tRNAsecondary structures to <file>.
  • -m <file> : save statistics summary for run
  • contains the run options selected as well as statistics on the number of tRNAs detected at each phase of the search, search speed, and other statistics.
slide36

Output using “–o” parameter

Output using “–f” parameter

slide38

Results

Output using “–m” parameter

slide41

Working

  • It works using two level of Hidden markov models.
  • The spotter model is constructed from highly conserved loci within a structural alignment of known rRNA sequences.
  • Once the spotter model detects an approximate position of a gene, flanking regions are extracted and parsed to the full model which matches the entire gene.
  • By enabling a two-level approach it is avoided to run a full model through an entire genome sequence allowing faster predictions.
slide42

Command line options

  • Rnammer -S (species) –m (molecules) –xml (xml file) –gff (gff file) –h (hmm report file) –f (fasta file)
  • -S : specify the species to use. In out case, it will be bacterial
  • -m : molecules to search for. (ie. Large subunit or small subunit)
slide43

Results

##gff-version2

##source-version RNAmmer-1.2

##date 2012-02-19

##Type DNA

# seqname source feature start end score +/- frame attribute

# ---------------------------------------------------------------------------------------------------------

84 RNAmmer-1.2 rRNA 28110 31006 3556.4 + . 23s_rRNA

84 RNAmmer-1.2 rRNA 31127 31241 82.9 + . 5s_rRNA

1 RNAmmer-1.2 rRNA 116969 117083 82.9 - . 5s_rRNA

60 RNAmmer-1.2 rRNA 338 452 82.9 + . 5s_rRNA

29 RNAmmer-1.2 rRNA 198 312 82.9 + . 5s_rRNA

84 RNAmmer-1.2 rRNA 25977 27507 1872.9 + . 16s_rRNA

# ---------------------------------------------------------------------------------------------------------

rfam database homology search
Rfam Database Homology Search
  • A collection of RNA families
    • Non-coding RNA genes
    • Structured cis-regulatory elements
    • Self-splicing RNAs
  • WU-BLAST search, and keeps hits with E-value < 1e-5
rfam preliminary results
Rfam Preliminary Results

The output format is: <rfam acc> <rfam id> <seq id> <seq start> <seq end> <strand> <score>

Results:

84Rfam similarity 25970275121477.28+ . evalue=2.08e-50;gc-content=52;id=SSU_rRNA_bacteria.1;model_end=1518;model_start=1;rfam-acc=RF00177;rfam-id=SSU_rRNA_bacteria

slide47

Things to be done

  • Get Geneprimp to work since we are having some problems with the installation and the web server takes a long time to process.
  • Get further information required to run other RNA prediction softwares.
  • Compare specific RNA prediction softwares with Rfam predictions.
leading biocomputational tools
Leading Biocomputational Tools
  • eQRNA (Rivas and Eddy 2001)
  • RNAz (Washietl et al. 2005; Gruber etal. 2010)
  • sRNAPredict3/SIPHT (Livny et al. 2006, 2008)
  • NAPP (Marchais et al. 2009)

All four approaches use comparative genomics!!

Lu, X., H. Goodrich-Blair, et al. (2011). "Assessing computational tools for the discovery of small RNA genes in bacteria." RNA17(9): 1635-1647