1 / 60

Gene Prediction

Gene Prediction. Computational Genomics February 6, 2012. OUTLINE. Background - Gene prediction - Protein Coding Sequences - Gene structure and ORF - Prokaryotic Gene Model - Biology of Haemophilus haemolyticus 2. Gene Prediction Approaches -Ab Initio Gene Prediction

cade
Download Presentation

Gene Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Prediction ComputationalGenomicsFebruary 6, 2012

  2. OUTLINE • Background • - Gene prediction • - Protein Coding Sequences • - Gene structure and ORF • - Prokaryotic Gene Model • - Biology of Haemophilus haemolyticus • 2. Gene Prediction Approaches • -Ab Initio Gene Prediction • -Homology based Gene Prediction • -RNA gene prediction • 3. Gene Prediction Improvement • 4. Strategy

  3. What is Gene Prediction ? Finding DNA sequences that encode proteins Protein-coding genes  RNA genes Functional elements -> Regulatory regions Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced

  4. Why develop gene finders? Technological improvements in high-throughput DNA sequencing are tremendously increasing the public availability of prokaryotic and eukaryotic genomes As of May 2010, 1,072 complete published bacterial genomes reported GOLD 4,289 bacterial genome projects are known to be ongoing (www.genomesonline.org).

  5. almost 2000 genomes completely sequenced by 2011 Sequencing projects are growing exponentially

  6. The underlying reasons for sequencing the genome of various bacteria are either because they are highly virulent to humans, animals or plants, or they can be applied to bioremediation or bioenergy production

  7. Extracting knowledge from data Growing amount of nucleotide sequence data requires also a concurrent development of adequate bioinformatics tools for comprehensive understanding of the genetic information they encode as well as of their underlying biology

  8. What is a Gene? A gene is an elementary unit of heredity which is indivisible in the functional sense A gene codes for discrete functional macromolecule (protein) or functional RNA Such definition does not work for alternatively processed transcription units A gene is a linear collection of exons that are incorporated into a specific mRNA

  9. Prokaryotic Gene Model: ORF genes • Small genomes, high gene density • - H. influenzae genome is 85% genic • Operons • - One transcript, many genes • No introns • - One gene, one protein • Open Reading Frames • - One ORF per gene • - ORF with start and stop codons

  10. Prokaryotic Gene Structure Eukaryotic Gene Structure

  11. Haemophilus haemolyticuswhat we know about our target system? Gram negative bacterium Facultative anaerobium Shape: Coccobacilli Emerging pathogen closely related to H. influenzae

  12. H. haemolyticus is most closely related to H. influenzae 16S rRNA gene infB gene Multilocus Sequence Analysis (MLSA)

  13. Why study Haemophilus haemolyticus ? 1. Genetic Diversity 2. Emerging Pathogen 3. Intrinsic Biological Value

  14. How Gene Prediction works ?

  15. Gene Prediction Methods

  16. Open Reading Frames • ORF (Open Reading Frame): • a sequence defined by in-frame AUG and stop codon, which in turn defines a putative amino acid sequence. • Simple first step in gene finding • Translate genomic sequence in six frames. Identify the stop codon in each frame. • Regions without stop codons are ORF • The longest ORF from a MET codon is a good prediction of protein encoding sequence.

  17. ORF Scanning • Use only sequence information. • Identify coding exons. • Integrate coding statistics to differentiate between coding and non-coding regions. (Real exons expected to show codon bias). • Calculate likelihood a triplet is in a coding region. • *Works relatively well for prokaryotic genomes where • non-coding component is small and no introns

  18. Predicting Prokaryotic Protein-Coding Genes Gene prediction is easier and more accurate in prokaryotes than eukaryotes since prokaryote gene structure is much simpler. The principle difficulties are: • detection of initiation site (AUG)• alternative start codons• gene overlap• undetected small proteins Inspite of these difficulties, prokaryote gene prediction can reach 99% accuracy.

  19. Protein Coding Methods

  20. Finding Genes in Prokaryotic DNA

  21. Ab initio methods • Intrinsic Gene Prediction Method. • Inspect the input sequence and search for traces of gene presence. • Extract information on gene locations using statistical patterns inside and outside gene regions as well as patterns typical of the gene boundaries. • ab initio algorithms implement intelligent methods to represent these patterns as a model of the gene structure in the organism.

  22. Markov model based tools • Several highly accurate prokaryotic gene-finding methods are based on Markov model algorithms.

  23. What are Hidden Markov Models? • Hidden Markov models (HMMs) are discrete Markov processes where every state generates an observation at each time step. • A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. [wiki]

  24. Markov Model (Discrete Markov Process) • A discrete Markov process is a sequence of random variables q1,…,qt that take values in a discrete set S={s1,…,sN} where the Markov property holds. • Markov property: • Parameters • Initial state probabilities: πi • State transition probabilities: aij

  25. From Markov Model to HMM • HMMs are discrete Markov processes where each state also emits an observation according to some probability distribution, we need to augment our model. • Parameters • Initial state probabilities: πi • State transition probabilities: aij • Emission probabilities: ei(k)

  26. HMM Example – Agnostic Drink Stand (1/2)

  27. HMM Example – Agnostic Drink Stand (2/2) Suppose we observed the following sequences: Vodka, Vodka, Coke, Vodka, Vodka, Vodka, Water, Water, Water, Water, Coke, Water, Coke, Coke, Water, Coke, Coke, Water, Coke, Coke, Coke, Vodka, Coke, Water, Vodka, Coke How might we infer the hidden states? A possible labeling: Vodka, Vodka, Coke, Vodka, Vodka, Vodka, Water, Water, Water, Water, Coke, Water, Coke, Coke, Water, Coke, Coke, Water, Coke, Coke, Coke, Vodka, Coke, Water, Vodka, Coke

  28. HMM Example in Sequencing Analysis

  29. HMM and Observation Sequence are Known  ?? • Given an HMM parameter θ and an observation sequence X1:T, which state sequence Q1:T best explains the observations? max P(Q|X,θ) • Viterbi algorithm

  30. How We Get HMM Parameters? • Training an HMM from labeled sequence

  31. Design a HMM model for Gene Prediction • The number of states in the model • Start codon • Stop codon • Intragenic codon • Intergenic region • The number of distinct observation symbols per state • State transition probability distribution • Observation symbol probability distribution • Initial state distribution • N-order Markov Model

  32. Ab Initio Gene Prediction Software • GeneMark.hmm

  33. Ab Initio Gene Prediction Software • GeneMarkS

  34. Ab Initio Gene Prediction Software • EasyGene

  35. Limitations of Current Methods • HMM has local averaging effect • Training process is slow and is case-sensitive • Algorithms are trained with sequences from known genes (overfitting problem) • MLE + Viterbi is not optimal (several tools have used the scaling factor to tweak the performance) • Overlapping genes

  36. Comparison of the Gene Finders

  37. Homology based methods • Tools: • BLAST • SGP2 • BLAT • Advantages: • Simplest. • Characterized with high accuracy. • Helps find the gene loci plus annotates the region. • Disadvantages: • Requires huge amounts of extrinsic data and finds only half of the genes. Many of the genes still have no significant homology to known genes. • Steps • Similarity search against the database • Multiple sequence alignment

  38. Searching against the Database • Steps • Use a heuristic (approximate) algorithm to discard most irrelevant sequences. (Based on Smith-Waterman algorithm) • Perform the exact algorithm on the small group of remaining sequences. • Representative algorithms • FASTA (Lipman & Pearson 1985) – First fast sequence searching algorithm for comparing a query sequence against a database • BLAST - Basic Local Alignment Search Technique (Altschul et al 1990) • Gapped BLAST (Altschul et al 1997)

  39. FASTA and BLAST • First, identify very short (almost) exact matches. • Next, the best short hits from the 1st step are extended to longer regions of similarity. • Finally, the best hits are optimized using the Smith-Waterman algorithm.

  40. FASTA Find runs of identities Score and discard low-scoring runs Eliminate segments unlikely to be part of alignment; apply banded Smith-Waterman to calculate opt score.

  41. BLAST • As sensitive as FASTA but much faster • Confine attention to segment pairs that contain a word pair of length w with a score of at least T • Phase 1: Compile a list of word pairs above threshold • Phase 2: Scan the database for the match word hits • Phase 3: Extend the hits

  42. BLAST Phase 1: List of Word Pairs • Compile a list of word pairs (w=3) above threshold T = 15 • Example: A query sequence …FSGTWYA… A list of words (w=3) is: FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS NTW neighborhood GTW 6,5,11 22 word hits GSW 6,1,11 18 > threshold ATW 0,5,11 16 NTW 0,5,11 16 neighborhood GTY 6,5,2 13 word hits GTM 6,5,-1 10 < threshold DAW -1,0,11 10 (T=15)

  43. BLAST Phase 3: Extend the Hit • When you manage to find a hit (i.e. a match between a “word” and a database entry), extend the hit in either direction. • Keep track of the score (use a scoring matrix). Stop when the score drops below some cutoff value X. • High-scoring Segment Pairs (HSPs) KENFDKARFSGTWYAMAKKDPEG Query Sequence MKGLDIQKVAGTWYSLAMAASD. Hit in the Database extend extend Hit!

  44. Gapped BLAST • Try to connect HSPs by aligning the sequences in between them • The Gapped BLAST algorithm allows several segments that are separated by short gaps to be connected together to one alignment THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEW___ASNINETEEN

  45. How to Interpret BLAST Results • E-value • Expected # of alignment with score at least S • Number of database hits you expect to find by chance size of database your score Increases linearly with length of query sequence and database Alignments expected number of random hits Decreases exponentially with score of alignment Score m = length of query; n= length of database; s= score K, λ: statistical parameters dependent upon scoring system and background residue frequencies

  46. From E-value to P-value • P-Value: probability of obtaining a score greater than a given score S at random P (S’>S) = 1– e-E Which is approximately E-value • Very small E-values are very similar to P-values. However, E-values of about 1 to 10 are far easier to interpret than corresponding P-values. E-Values P-Values 10 0.99995460 5 0.99326205 2 0.86466472 1 0.63212056 0.1 0.09516258 (about 0.1) 0.05 0.04877058 (about 0.05) 0.001 0.00099950 (about 0.001) 0.0001 0.0001000

  47. BLAST and BLAST-like programs • Traditional BLAST (formerly blastall) nucleotide, protein, translations • blastnnucleotide query vs. nucleotide database • blastpprotein query vs. protein database • blastxnucleotide query vs. protein database • tblastnprotein query vs. translated nucleotide database • tblastxtranslated query vs. translated database • Megablast nucleotide only • Contiguous megablast • Nearly identical sequences • Discontiguous megablast • Cross-species comparison • Position Specific BLAST Programs protein only • Position Specific Iterative BLAST (PSI-BLAST) • Automatically generates a position specific score matrix (PSSM) • Reverse PSI-BLAST (RPS-BLAST) • Searches a database ofPSI-BLAST PSSMs

  48. Multiple Sequence Alignment • Smith-Waterman algorithm

More Related