Lecture 12

Bioinformatics Lecture 12 • Splicing and gene prediction in eukaryotes • Critical splice signals • Coding statistics: DNA differences between • exons and introns • Discriminant function and combined approach

Splicing and gene prediction in eukaryotes • Any type of gene prediction and particularly ab initio is tremendously complicated in eukaryotes by the splicing phenomenon. • The task is difficult, to predict positions of exon-intron boundaries for those eukaryotic genes, which have multiple introns, and to predict absence of introns for intronless genes. • Eukaryotic genomes differ significantly in a number of ways, which requires species specific prediction programs. • The major differences include: a) variation in GC-content (e.g. mammalian genomes have large variation in GC-content, referred as isochors), b) variation in codon usage frequencies. • All these factors, if not taken into consideration, diminish quality of prediction.

AT/GC ratios in coding regions in some eukaryotes

The number of correct and incorrect (number in parentheses) of whole gene model predictions shared among the 3 programs from a test set of 1783 genes GenMark.hmm(GM) Genscan+(GS) Incorrect gene refers to cases in which all coding exons in the gene are in perfect agreement among the gene finders but not with the true gene GlimmerM(GA)

mRNA splicing

EXON 1 INTRON EXON 2 A G G UA/GA GUU U A/G A U/CU/CA G (100%) ( 62 –68 %) (100%) G/A Donor site 5’ splice junction Branch site Acceptor site 3’ splice junction Critical splice signals

Frequencies of nucleotides at the ends of exons The first 10 nucleotides of exons, 5’ end The last 10 nucleotides of exons, 3’ end

Recognition of variable splice sites and gene prediction • At least 3 critical signals/motifs (donor, acceptor and branch sites) should be recognised in order to predict position of an intron and both splice junctions. • Significant sequence variation in these sites between species and different genes negatively affects quality of predictions. • The best average of error (false-positive + false-negative) rate for either donor or acceptor site prediction is about 5%. This may be acceptable if the search is restricted by a short region. However search of a large region leads to unacceptable rate of the false-positive because for every true site there are hundreds of pseudo-sites. • For example, if a large region has 40 true sites and 4000 pseudo-sites, one true site would be missed (2.5% false-negatives) and 100 pseudo-sites would be predicted as true sites (2.5% false-positives)!

Recognition of variable splice sites and gene prediction • Since adjacent donor site and acceptor site are not independent, this correlation can be explored for further eliminating false-positives. • For short introns, occurring mostly in lower eukaryotes, an intron is recognized by the interaction of splicing factors binding across the intron-ends (hence 5’ss – 3’ss correlation). • In vertebrates, exons are much shorter, recognition of exons by the interaction of splicing factors binding across the exon-ends (hence 3’ss – 5’ss correlation) is the key. • Therefore mammalian functional splice sites can only be effectively identified simultaneously through exon recognition. • Also there are several additional signals/motifs essential for the correct splicing, which are responsible for recognition of certain proteins involved in splicing. Identification of such sites and their use in prediction programs should increase quality of eukaryotic gene predictions.

Coding statistics: DNA differences between exons and introns • Except splicing signals and ORF there are several additional characteristics, which may help to discriminate between exons and introns including • These features include DNA periodicity in exons, codon preferences, hexamer usage, codon prototype, compositional bias between codon positions

DNA periodicity in exons

DNA periodicity in exons,   3

Periodic structure in DNA sequences. The absolute frequency of the A A pair with ( 0 to 5) nucleotides between the two A's in the 200 first base pairs of the sequences in the set of 1761 human exons and 1753 human introns. A clear period-3 pattern appears in coding regions, which is absent in non-coding regions. A similar periodic pattern appears in coding regions for the other fifteen possible pairs of nucleotides.

Codon Preference • A coding statistic was introduced to measure uneven usage of synonymous codons solely. • Indeed, from a codon usage table, we can compute the relative probability of each synonymous codon to code for a given amino acid. • For instance, GAG and GAA the two codons coding for Glutamic Acid are used in coding regions with probabilities 0.03882 and 0.02751, which results in a relative probability of 0.59 and 0.41, respectively.

Hexamer usage correlation • Bias in the distribution of oligonucleotides longer than codons can also be used to discriminate between coding and non-coding regions. Bias in the usage of hexamers may be the most discriminant one (probably because of dependence between adjacent amino acids in the proteins).Bias in hexamer usage can be computed exactly as bias in codon usage as the background information for codon frequencies is known and frequencies of each of 642 = 4096 hexamers can be found. • There are several ways to construct frame specific hexamer score, both log-odd LE(w,i) = log [fE(w,i)/fI(w)] and preference score PE(w,i) = fE(w,i) / [fE(w,i) + fI(w)], where fE(w,i) is frequency of hexamer w in frame i, calculated from known exon training data and fI(w) is the frequency of w from known introns. Probabilities of the four nucleotides at the different codon positions conditioned to the nucleotide in the preceding codon position. Estimated from a set of human exon and intron sequences.

Nucleotide Codon position 1 2 3 A 0.27 0.31 0.18 C 0.24 0.24 0.31 G 0.32 0.20 0.29 T 0.17 0.26 0.22 Codon Prototype, Markov model measure and Average Mutual Information • A measure can be introduced which show how similar to the prototypical distribution (see the table) is the observed distribution of base frequencies at the three codon positions in a sequence (exon or intron). • Dependencies between nucleotide positions in coding regions can be explicitly described by means of Markov Models. • Average Mutual Information can measure the probability in the sequence of the pair of nucleotides i and j and at a distance of k nucleotides.

Exon sequence Intron sequence Coding frame Non-coding frames Frame 1 Frame 2 Frame 3 Codon Usage 24.06 -16.13 -3.16 -14.36 -23.74 -19.67 Hexamer Usage 27.62 -11.64 -6.51 -20.90 -27.56 -22.07 39.98 -14.58 -8.46 -26.73 -27.81 -25.87 Codon Preference 15.97 -1.32 7.24 -7.96 -12.70 -14.93 Amino Acid Usage 8.17 -14.87 -10.17 -6.15 -10.69 -4.57 Codon Prototype 9.87 -11.23 -10.30 -11.45 -17.44 -14.49 Markov Model order 1 29.92 -2.69 -3.31 -35.44 -42.40 -41.73 order 2 34.73 -18.26 -7.77 -29.61 -41.76 -40.05 order 5 72.69 -21.38 13.56 -37.63 -30.99 -36.40 Position Asymmetry 0.0957 0.0211 Periodic Asymmetry Index 1.159 1.009 Average Mutual Information 0.00681 0.000344 Fourier Spectrum 2.278 0.892 Values of different coding statistics in the 223 bp long 2nd coding exon of the human -globin gene, and in a 223 bp long seq. from the middle of the 2nd intron of the same gene

EPS Pattern discriminant analysis • A number of different pattern features of sequences are used to discriminate coding (ex) and non coding seq. A linear and quadratic analysis are shown with the later being more efficient. EPS is the 6-mer exon preference score and 3’SS (3’splicing site) is an example

COMBINERcomputational gene prediction using multiple sources of evidence • The next generation of computational method able to construct gene models is currently developed, which takes as input (combines) a genomic sequence and the locations of gene predictions from ab initio gene finders, protein sequence alignments, expressed sequence tag (EST) and cDNA alignments, splice site predictions, and other evidence • An example of such program is COMBINER, which uses rigorous statistical assessments, evaluate candidate gene models and estimate probabilities using so-called decision trees.

Lecture 12

Lecture 12

Presentation Transcript

Lecture 12

Lecture 12

Lecture 12

Lecture 12

Lecture 12

Lecture 12

Lecture 12

Lecture 12

Lecture 12

Lecture 12

Lecture 12

Lecture #12

Lecture 12

Lecture 12

Lecture 12

Lecture 12

Lecture 12

Lecture 12

Lecture 12

Lecture 12

LECTURE - 12