Progress report. Yiming Zhang 02/10/2012. All AS events in ASIP. Intron retention Exon skipping Alternative Acceptor site NAGNAG AltA Alternative Donor site GYNGYN AltD Alternative both sites ( AltP ). NAGNAG alternative splicing.
Figure 1. NAGNAG alternative splicing with E and I sites and isoforms.
NAGNAG alternaive splicing can result in one of three possibilities (Figure 1) - constitutive use of the first acceptor (the so-called exonic, or “E” variant), constitutive use of the second acceptor (the so-called intronic, or “I” variant), or use of both acceptors, that is,alternative splicing (the “EI” variant).
Sinha et al. 2010
Figure 2. GYNGYN alternative splicing with e and i sites and isoforms.
Hilller et al. 2006
Table 1. Intron statistics from ASIP. 4 species which have small amount of data are not listed here. All statistics are intron-based instead of event-based which means redundancy has been removed. The most common type of alternative intron type is IntronR, second common type is ExonS. NAGNAG AS occurs much more frequently in AltA than GYNGYN AS occurs in AltD.
NAGNAG alternative splicing which can insert or delete a single amino acid in the protein, is very common and well studied in animals.
Iida et al. 2008 Sinha et al. 2010
The studies of NAGNAG AS in plant is few right now (Only 3 species: Arabidopsis, Rice and Physcomitrala).
The state-of-the-art in silico studies for prediction of NAGNAG splice site are done by Sinha's group for both human and plant species. They achieved high balanced specificity and sensitivity for both human and plant species.
Sinha et al. 2009, 2010
I tried to predict NAGNAG events (thus to predict EI, I or E isoforms) based on the dataset I generated from ASIP using Random Forest.
Figure 3. A total of 28 features which each represented a nucleotide, and thus had four possible values (A, C, G, T). U1, U2, U3 are the first three nucleotides in the upstream exon. D1, D2, D3 are the first three nucleotides in the downstream exon. A weak polypyrimidine tract (PPT) can contribute to AS. So P1-P20 are PPT upstream of NAGNAG. Finally, I also use intron length as an additional feature.
Random Forest with 200 trees has been used and 5 fold cross validation has been applied.
The evaluation results strongly agree with Sinha’s paper (For Physcomitrella)in which AUC = 0.96, 0.99 and 0.98 for the EI, E and I forms, respectively.
Figure 4. The EI class, or AS, harder to predict (AUC = 0.967) than the two constitutive variants, E and I (AUC = 0.995 for both).
Figure 5. Most informative features according to information gain.
Figure 6a-6d. Sequence logos of NAGNAG splice sites. 6a: E sites; 6b: I sites; 6c: EI sites; 6d: all splice sites. Position 1-3 is U1-U3. Position 4-24 are P20-P1. Position 30-32 are D1-D3.