76 Views

Download Presentation
## ncRNA detection w/ multiple alignments

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**ncRNA detection w/ multiple alignments**Vineet Bafna**Comparative detection of ncRNA**• Given a pairwise alignment, QRNA decides if it is RNA, coding or Other • The key to detecting RNA is covarying mutations. • Multiple alignment should provide more information on covarying mutations. Vineet Bafna**RNAz**• Computes the probability of ncRNA in a multiple alignment. • RNAz computes two ‘novel’ statistics: • Min. Free Energy of sequences (MFE) • Conserved secondary structure (SCI) • Train an SVM using the following features • MFE • SCI • Mean pairwise identity • Number of sequences in the input Vineet Bafna**SCI**• Apply min. energy folding to a multiple alignment. • The score of a pair of column is dependent upon base-pairing as well as compensatory mutations. • Let EA denote the consensus fold energy. • Let E denote the average MFE of all sequences • SCI = EA / E • Claim : Low SCI is bad, high is good • Q: What is the SCI for diverged (random) sequences? • What is the SCI for identical sequences? Vineet Bafna**MFE**• Compute a z-score for a sequence with MFE=m • Z = (m-)/ • Instead of computing , by shuffling, and computing (slow) • Use regression to predict , from sequence length and base composition. Vineet Bafna**Non-linear classification**• The z-statistic and SCI capture different properties. • Green is good (native), red is bad (shuffed). • Is SCI a good statistic, given different levels of sequence identity? Vineet Bafna**Using RNAz to predict ncRNA**• Applying RNAz to conserved regions results in a discovery of 30k putative RNA. • Is this list complete? Is it valid? Vineet Bafna**Structural Alignment**X07545 ..ACCCGGC.CAUA...GUGGCCG.GGCAA.CAC.CCGG.U.C..UCGUUM21086 ..ACCCGGC.CAUA...GCGGCCG.GGCAA.CAC.CCGG.A.C..UCAUGX05870 ..ACCCGGC.CACA...GUGAGCG.GGCAA.CAC.CCGG.A.C..UCAUUU05019 ..ACCCGGU.CAUA...GUGAGCG.GGUAA.CAC.CCGG.A.C..UCGUUM16530 ..ACCCGGC.AAUA...GGCGCCGGUGCUA.CGC.CCGG.U.C..UCUUCX01588 ..ACCCGGU.CACA...GUGAGCG.GGCAA.CAC.CCGG.A.C..UCAUUAF034619 ...GGCGGC.CACA...GCGGUGG.GGUUGCCUC.CCGU.A.C..CCAUCL27170 AGUGGUGGC.CAUA...UCGGCGG.GGUUC.CUCCCCGU.A.C..CCAUC X05532 AGGAACGGC.CAUA...CCACGUC.GAUCG.CAC.CACA.U.C..CCGUC #=GC <<<<<<<<<........<<.<<<<.<...<.<...<<<<.<.<....... Conserved sequences, and conserved structure are more apparent in multiple alignments. Vineet Bafna**RNA multiple alignments**• Detection of RNA depends upon reliable prediction of covarying mutations, as well as regions of conserved sequence • Precomputing multiple alignments based on sequence considerations is probably not sufficient (should be tested). • How can structural alignments be computed? Vineet Bafna**Computing Structural Alignments**• Analogy: In sequence alignment, the score for aligning a column is position independent. • In profiles, or HMMs, position specific scoring is used to distinguish conserved positions from non-conserved positions • Similar ideas can be used for RNA. G U G G C C G G C G G C C G G U G A G C G G U G A G C G G C G C C G G U G A G C G G C G G U G G U C G G C G G C C A C G U C Pr(G|1) = 0.8 1 2 3 4 1 2 3 Vineet Bafna**Covariance models=RNA profiles**a W’2 b S W1 a W2 W3 b a W4 b : : Terminal symbols correspond to columns A A A U - A A A A U U U U - A - - - A U Vineet Bafna**Aligning a sequence to a covariance model**• We align each node of the covariance model (it is tree like, but may be a graph). • The alignment score follows the same recurrence as in Lecture 7, but with position specific probabilities. • Example: • A[Wi,(i,j)] = -log (Pr[Wi->s[i] Wj s[j])+A[Wj,(i+1,j-1)] • If we wish to compute the probability that a sequence belongs to a family, we compute the total likelihood (sum over all probabilities) • If we wish to compute the structure of an unknown sequence by comparison to a covariance model, we compute the max likelihood parse in this graph. Vineet Bafna**Covariance models and ncRNA discovery**• Given a family of ncRNA sequences, scan a genomic sequence with a covariance model and retrieve all high scoring sub-sequences. • This is the most common method, but it is expensive. • Assume covariance model has m states, and the substring has at most n symbols, and the database has L symbols. • Alignment cost = O(n2m1+n3m2) • Total time =? Vineet Bafna**Computing covariance models**• If we are given a CM, a multiple structural alignment is ‘easy’. • In turn, align each sequence to the CM. • If we are given a multiple alignment, computing the covariance model is easy • For simultaneous prediction, a Bayesian iterative approach is used • Compute a seed alignment • Use the alignment to compute a CM • Use the CM to compute a new alignment • Iterate Vineet Bafna**Open**• Compute a structural multiple alignment. • Existing methods do not work well without good seed alignment, and require excessive hand curation. • Here, we solve a simpler problem • Predict conserved structure in unaligned sequences. Vineet Bafna**Motivation to a new approach**p = (1/4)5 < 0.001. ACCUU AAGGA • Base-pairs appear in ‘clusters’: we call them stacks, which is energetically favorable. • Most of the stability of the RNA secondary structure is determined by stacks. Vineet Bafna**Statistics of the stacks in Rfam database**• Most base-pairs are stacked up Vineet Bafna**Using stacks as anchors for predictions**• The idea of anchors as constraints has been used in multiple genomic sequence alignment. • MAVID (Bray and Pachter, 2004) • TBA (Blanchette et al., 2004) • Several heuristic methods have been developed by finding anchored stacks: • Waterman (1989) used a statistical approach to choose conserved stacks within fixed-size windows. • Ji and Stormo (2004) and Perriquet et al. (2003) use primary sequence conservation of the stacks and the length of loop regions to reduce the searching space. • stack anchor has low sequence similarity. • It’s hard to find correct anchors Vineet Bafna**Problem**• Selecting one stack at a time may cause wrong matching stacks. Vineet Bafna**A global approach: configuration of stacks**• RNA secondary structure can be viewed as stacks plus unpaired loops. (no individual base-pairs) • The energy of the structure is the sum of the energies of stacks and loops. • Stack configuration: • Nested stacks • Parallel stacks • Crossing stacks (pseudo knots) • More generalized stacks can include mismatches in the stacks. Vineet Bafna**RNA Stack-based Consensus Folding (RNAscf) problem**• Find conserved stack configurations for a set of unaligned RNA sequence. • Optimize both stability (free energy) of the structure and sequence similarity computed based on these common stacks as anchors. Vineet Bafna**RNA stack-based consensus folding for pairwise sequences**Vineet Bafna**A matching stack-configurations on two sequences**Sequence similarity of unpaired regions Weights of different costs. Sequence similarity of stacks Energy of the consensus structure Vineet Bafna**RNA Stack-based Consensus Folding for multiple sequences**Vineet Bafna**Cost function for multiple sequences**… A1,1 A1,2 A1,3 A1,4 A1,5 A1,6 A1,k-2 A1,k-1 A1,k … A2,1 A2,2 A2,3 A2,4 A2,5 A2,6 A2,k-2 A2,k-1 A2,k . . . … As,1 As,2 As,3 As,4 As,5 As,6 As,k-2 As,k-1 As,k Vineet Bafna**Compute an optimal stack configuration for two sequences**• Dynamic programming algorithm is used to align RNA sequences and find an optimal configuration at the same time. • The algorithm is similar to prior work (Sankoff 1985, Bafna et al. 1995) • Differences: • We use stacks as the basic structural elements. • Prior work used individual base pairs. • The computational time is O(n4) (n is the number of stacks). • Sankoff’s algorithm is O(m6), (m is the length of the sequences). • The number of possible stacks (size >= 4) is much smaller than the length of the sequence. • It’s much faster. Vineet Bafna**For any pair of stacks, there are three choices:**PA Loop(PA) PA Loop(PB) PB PA PX hairpin loop PA P1A PiA PB PY PB interior loop/bulge P1B PjB PB multi-loop Vineet Bafna**The score of matching stacks:**PA PB Vineet Bafna**The score of matching hairpin loops:**PA Loop(PA) Loop(PB) PB Vineet Bafna**The score of matching interior loops or bulges:**Loop(PX,PA) PA PX PY PB Loop(PY,PA) Vineet Bafna**The score of matching two multi-loops:**Loop(Pi,PA) PA PiA P1A P1B PjB PB Loop(Pi,PB) Vineet Bafna**Consensus folding for multiple sequences**• We use a heuristic method based on the notion of star-alignment. • Compute an optimal configuration from a random seed pair. • Align all individual sequences to this configuration. • Choose the conserved stack configuration in all sequences. • Allow some stacks to be partially conserved (at leastappear in a certainfraction of the sequences). Vineet Bafna**Compute the stack configuration for multiple sequences:**RNAscf(k,h,f) . . . . . . . . . . . . Vineet Bafna**Iterative procedure for RNAscf**• P = RNAscf(k, h, f). • In each sequence, extract the unpaired regions according to the loop regions in P. • Predict additional putative stacks that are not crossing with P using smaller k’ and h’. • Recompute the alignment for with additional putative stacks using RNAscf(k’,h’,f). Vineet Bafna**Test dataset**• We choose a set of 12 RNA families from Rfam database: • 20 sequences chosen from the families. (except for CRE and glms, we choose 10 sequences) with annotated structures. • There are 953 stacks. • We compare RNAscf with 3 other programs that are available online for RNA folding: • RNAfold (energy based minimization) (Hofacker 2003) • COVE (covariance model) (Eddy and Durbin 1994) • Cove need a staring seed alignment which is produced by ClustalW. • comRNA (computing anchors in multiple sequences) (Ji, Xu and Stormo 2004). • Sensitivity: the fraction of true stacks that overlapped with predicted stacks. • Accuracy: the fraction of predicted stacks that overlapped with true stacks Vineet Bafna**Test results**Vineet Bafna**Test results**Vineet Bafna**Test results**Vineet Bafna**Performance improves when the number of sequences increases**(Using Thiamine riboswitch subfamily (RF00059)) Vineet Bafna**RNAscf always finds the right consensus stack configuration.**(Sam riboswitch (RF00162)) Vineet Bafna**Conclusion and future work**• RNAscf is a valid approach to RNA consensus structure prediction. • Use stack configuration to represent RNA secondary structure. • Propose a dynamic programming algorithm to find optimal stack configuration for pairwise sequences. • Use both primary sequence information and energy information. • Use a star-alignment-like heuristic method to get the consensus structure for multiple sequences. Vineet Bafna**Conclusion**• There is a signal due to to covarying mutations that is a good predictor of RNA structure. • Can RNAscf scores be used as a statistic to discover ncRNA in ‘unaligned’ sequences? • How good are sequence based alignments? Do they preserve structure? • Not for diverged families • Possibly for orthologous regions Vineet Bafna**ncRNA discovery for specific families**Vineet Bafna**Case study: miRNA**• dsRNA, and siRNA can be used to silence genes in mammalian tissue culture. • miRNA is a new member of this class of endogenous interfering RNA • RNA interference (RNAi) is a pwerful new technique to study gene function. Vineet Bafna**Case Study: miRNA**• ncRNA ~22 nt in length • Pairs to sites within the 3’ UTR, specifying translational repression. • Similar to siRNA (involved in RNAi) • Unlike siRNA, miRNA do not need perfect base complementarity • No computational techniques to predict miRNA • Most predictions based on cloning small RNAs from size fractionated samples Vineet Bafna**miRNA (vs. siRNA)**• Derived from transcripts that form local hairpin structures. • Sequences of the precursor, and processed miRNA is evolutionarily conserved • Usually distinct, and distant, from other genes • siRNA (by contrast) • Not evolutionarily conserved • Correspond to sequences of known or predicted mRNAs, transposons, or regions of heterochromatic DNA. Vineet Bafna**MiRscan**• Predicts miRNA • Start with evolutionarily conserved region. Ex: C. elegans and C. briggsae • 36000 hairpins were found (including 50/53 known miRNA). • 50 known miRNA were used to train and score the 36000 hairpins Vineet Bafna**Computational identification of miRNA**• 7 features are scored • miRNA base-pairing • Base-pairing of the rest of the fold-back • Stringent sequence conservation in the 5’ end of fold back • Sequence conservation in the 3’ end of fold back • Sequence bias in the first 5 bases of miRNA • Tendency to form symmetric internal loops • Presence of 2-9 consensus base-pairs between miRNA and terminal loop region • Red: Conserved with C. briggsae • Blue: varying residues that maintain their predicted paired or unpaired states Vineet Bafna**MiRscan scoring**• 35 previously unannotated hairpins exceeded the Median score Vineet Bafna**Molecular identification of miRNA**• Initial cloning and sequencing identified 300 clones representing 54 unique miRNA • 10 fold scale up of the procedure identified 3423 clones as miRNA. These contain 77 distinct miRNA genes • 77-54=23 novel miRNAs found • 20 were scored by MiRscan (yellow). 10 were among the top 35 Vineet Bafna