1 / 26

Gene Finding

Gene Finding. Charles Yan. Gene Finding. C ontent s ensors Extrinsic content sensors Intrinsic content sensors S ignal sensors Splice site prediction Promoter prediction Poly(A) sites prediction T ranslation initiation codon prediction Combining the evidence to predict gene structures.

morrison
Download Presentation

Gene Finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Finding Charles Yan

  2. Gene Finding Content sensors • Extrinsic content sensors • Intrinsic content sensors Signal sensors • Splice site prediction • Promoter prediction • Poly(A) sites prediction • Translation initiation codon prediction Combining the evidence to predict gene structures

  3. Combining Evidence Since 1990programs are no longer limited tosearching for independent exons, but try instead to identify thewhole complex structure of a gene. • Given a sequence andusing signal sensors, one can accumulate evidence on the occurrence of signals: translation starts and stops and splicesites are the most important ones since they define theboundaries of coding regions

  4. Combining Evidence In theory, each consistent pair ofdetected signals defines a potential gene region (intron, exonor coding part of an exon). If one considers that all thesepotential gene regions can be used to build a gene model, thenumber of potential gene models grows exponentially with thenumber of predicted exons.

  5. Combining Evidence In practice, this is slightly reducedby the fact that `correct' gene structures must satisfy a set ofproperties: • There are no overlapping exons • Codingexons must be frame compatible • Merging two successivecoding exons will not generate an in-frame stop at thejunction

  6. Combining Evidence The number of candidates remains, however,exponential. In almost all existing approaches, such anexponential number is coped with in reasonable time by usingdynamic programming techniques.

  7. Combining Evidence • Extrinsic Approaches • The contentof exon/intron regions was assessed using extrinsic sensors • Intrinsic approaches • The contentof exon/intron regions was assessed using extrinsic sensors • Integrated approaches • Combine evidence coming from both intrinsic and extrinsic content sensors

  8. Extrinsic Approaches The principle of most of these programs isto combine similarity information with signal informationobtained by signal sensors.

  9. Extrinsic Approaches Very briefly, all the programs in this class may be seen assophistications of the traditional Smith-Waterman localalignment algorithm where the existence of a signal allowsfor the opening (donor) or closure (acceptor) of a gap with anessentially free extension cost. They are often referred to as`spliced alignment' programs.

  10. Extrinsic Approaches • Existing software may befurther divided according to the type of similarity exploited:genomic DNA/protein, genomic DNA/cDNA or genomicDNA/genomic DNA. • Some of these methods are able to dealwith more than one type and to take into account possibleframeshifts in the genomic DNA or cDNA sequences.

  11. Extrinsic Approaches Procrustes • To align a genomic sequencewith a protein. • Considers all potential exons from the queryDNA sequence, initially with the only constraint that theymust be bordered by donor and acceptor sites. • All possible exon assembliesare explored by translating the exons and aligning them withthe target protein. • Other programs performing the same task areGeneWise, PredictGenes, ORFgene and ALN.

  12. Extrinsic Approaches Some programs, like INFO and ICE, use adictionary-based approach: they first create dictionaries ofk long segments from a protein or an EST database and then,using a look-up procedure, find all segments in the query DNAsequence having a match in the dictionary.

  13. Combining Evidence • Extrinsic Approaches • The contentof exon/intron regions was assessed using extrinsic sensors • Intrinsic approaches • The contentof exon/intron regions was assessed using extrinsic sensors • Combine evidence coming from both intrinsic and extrinsic content sensors

  14. Intrinsic approaches • In the exon-based category, the gene assembly is separatedfrom the coding segments prediction step. The goal is to findthe highest scoring genes, the gene score being a simple function (usually the sum) of the scores of the assembled segments. In theory at The segment assembly process can be defined as the search for an optimal path in a directed acyclic graph where vertices represent exons and edges represent compatibility between exons. This is the approach adopted by the GeneId, GenView2, GAP3, FGENE and DAGGER programs

  15. Intrinsic approaches • In the signal-based methods, the gene assembly is produced directly from the set of detected signals.

  16. Intrinsic approaches To effciently deal with the exponential number of possiblegene structures defined by potential signals, almost allintrinsic gene finders use dynamic programming (DP) toidentify the most likely gene structures according to theevidence defined by both content and signal sensors.

  17. Integrated Approaches • Integrated approaches • Combining both intrinsic and extrinsic. • Combinethe predictions of several programs in order to obtain a sort ofconsensus.

  18. Gene Finding Content sensors • Extrinsic content sensors • Intrinsic content sensors Signal sensors • Splice site prediction • Promoter prediction • Poly(A) sites prediction • Translation initiation codon prediction Combining the evidence to predict gene structures

  19. Pitfalls and Issues Several issues make the problem of eukaryotic gene findingextremely difficult. • Very long genes: for example, thelargest human gene, the dystrophin gene, is composed of 79exons spanning nearly 2.3 Mb. • Very long introns:again, in the human dystrophin gene, some introns are >100 kblong and >99% of the gene is composed of introns.

  20. Pitfalls and Issues • Veryconserved introns. this is particularly a problemwhen gene prediction is addressed through similarity searches.

  21. Pitfalls and Issues • Very short exons: some exons areonly 3 bp long in Arabidopsis genes and probably even 1 bpfor the coding part of exons at either end of the codingsequence, meaning that start or stop codons can be interruptedby an intron. Such small exons areeasily missed by all content sensors, especially if bordered bylarge introns. The more difficult cases are those where thelength of a coding exon is a multiple of three (typically 3, 6 or9 bp long), because missing such exons will not cause aproblem in the exon assembly as they do not introduce anychange in the frame.

  22. Pitfalls and Issues • Overlapping genes: though very rare in eukaryoticgenomes, there are some documented cases in animals as wellas in plants • Polycistronic gene arrangement: one gene, and one mRNA, but two or more proteins.

  23. Pitfalls and Issues • Frameshifts: some sequences stored in databases maycontain errors (either sequencing errors or simply errors madewhen editing the sequence) resulting in the introduction ofartificial frameshifts (deletion or insertion of one base). Suchframeshifts greatly increase the difficulty of the computationalgene finding problem by producing erroneous statistics andmasking true solutions.

  24. Pitfalls and Issues • Introns in non-coding regions: there are genes for whichthe genomic region corresponding to the 5`- and/or 3`-UTR inthe mature mRNA is interrupted by one or more intron(s). • Alternative transcription start: e.g. three alternative promotersregulate the transcription of the 14 kb full-length dystrophinmRNAs and four `intragenic' promoters control that ofsmaller isoforms.

  25. Pitfalls and Issues • Alternative splicing. • Alternative polyadenylation: 20% of human transcripts showing evidence of alternative polyadenylation.

  26. Pitfalls and Issues • Alternative initiation of translation: finding the rightAUG initiator is still a major concern for gene predictionmethods. the rule stating that the firrst AUG in the mRNA is theinitiator codon can be escaped through three mechanisms:context-dependent leaky scanning, re-initiation and directinternal initiation. Non-AUG triplet can sometimes act as the functional codon fortranslation initiation, as ACG in Arabidopsis or CUG inhuman sequences

More Related