Searching genomes for noncoding RNA - PowerPoint PPT Presentation

slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Searching genomes for noncoding RNA PowerPoint Presentation
Download Presentation
Searching genomes for noncoding RNA

play fullscreen
1 / 80
Searching genomes for noncoding RNA
99 Views
Download Presentation
jui
Download Presentation

Searching genomes for noncoding RNA

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Searching genomes for noncoding RNA CS374 Leticia Britos 10/03/06

  2. DNA to RNA, and genes RNA: carries the “message” for “translating”, or “expressing” one gene A DNA, ~3x109 long in humans Contains ~ 22,000 genes G C G transcription translation easy 2 A 3 folding C 1 U G

  3. “Structural genes encode proteins and regulatory genes produce non-coding RNA” F. Jacob and J. Monod (1961)

  4. Gene Finding Where are the genes?

  5. Gene Finding atg Where are the genes? ggtgag cagatg ggtgag cagttg ggtgag caggcc ggtgag In humans: ~22,000 genes ~1.5% of human DNA tga

  6. An expanding universe of noncoding RNA • rRNA(structure/function of ribosomes) • tRNA(translation) • snRNA(RNA splicing, telomere maintenance) • snoRNA(chemical modification of rRNA) • miRNA(translational regulation) • gRNA(mRNA editing) • tmRNA(degradation of defective proteins) • riboswitches (translational and transcriptional regulation) • ribozymes(autocatalytic RNA) • RNAi (gene regulation by dsRNA)

  7. Exciting times for the RNA world (and for Stanford)

  8. How to find ncRNAs? atg caggtg ggtgag cagatg ggtgag cagttg ggtgag caggcc ggtgag tga

  9. How to find ncRNAs?

  10. introns 5 ’ 3 ’ promoter exons 3 ’ UTR 5 ’ UTR coding Riboswitches introns introns 5 5 ’ ’ 3 3 ’ ’ promoter promoter exons exons 3 3 ’ ’ UTR UTR 5 5 ’ ’ UTR UTR coding coding noncoding

  11. Sequence conservation is not enough

  12. Secondary structure is not enough

  13. Noncoding RNA signals in the genome are not as strong as the signals for protein coding genes Look for structure in evolutionary conserved sequences

  14. Identify new instances of a given ncRNA family in a genome

  15. Existing algorithms • CMSearch • RSEARCH • ERPIN

  16. Example: finding 5S RNAs in a 1.6Mb genome • RSEARCH: 6.5 h • FastR: 103 s

  17. FastR

  18. What is a Database filter? A computational procedure that takes a DB as input and outputs a subset of it. • The object being searched for remains in the DB after filtering (sensitivity) • The filtered DB is significantly smaller • The filtering operation is fast (efficiency) filter

  19. Problem • Given an RNA sequence with known structure, find homologous sequences in a RNA DB AGAGCGUAUCGAUUUAGAGAGCUAUAGCUAGAGAGGAGA UUAUAGCGCGCAUAUAGGACAAACAGUCUCUAUGGGGAC AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCAA AUCGCGCUAUAGCUAGCGAGGACAGCUAUAGCUAGCGAG AUCGCGCUAUAGCUAGCGAGGACAGCUAUAGCUAGCGAG AUAUCGGGCUGUGGACACUAUACGAUCGAAUCUAGCUAU AUCGCGCUAUAGCUAGCGAGGACAGCUAUAGCUAGCGAG AUAUCGGGCUGUGGACACUAUACGAUCGAAUCUAGCUAU DB QUERY

  20. Solution • Stage 1: filter the DB • Stage 2: align the selected sequences in the DB with the query and determine the best alignments

  21. Filtering • Sequence alone is not sufficient • Structure alone is not sufficient

  22. Filter using both sequence and structural features

  23. 6 25 28 3 a a’ a a’ Structural features: (k,w)-stacks a a’ AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCAA a a’

  24. a’ a d = 18 Definition of a (k,w)-stack AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCAA A pair of substrings of at least length k, that are at most w bases apart AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCAA Is a,a’ : (4,18)-stack? (4,20)-stack? (4,9)-stack? (3, 20)-stack?    

  25. Use of (k,w)-stacks as filters in the search for ncRNAs If we use a (7,70)-stack filter, we eliminate 90% of the DB from consideration

  26. 6 25 28 3 16 12 14 18 Structural features: nested (k,w,l)-stacks AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCAA

  27. 6 25 28 3 16 12 14 18 35 34 32 36 Structural features: parallel (k,w,l)-stacks AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCAA

  28. Structural features: multiloop (k,w,l)-stacks

  29. Filtering criteria nested stacks Parallel stacks Multiloop stacks

  30. Filtering algorithm 1. Build a hash table of kmers in the DB k=4 AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA kmer index AUUC 1

  31. UUCC 2 Filtering algorithm 1. Build a hash table of kmers in the DB k=4 AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA kmer index AUUC 1

  32. UUCC 2 Filtering algorithm 1. Build a hash table of kmers in the DB k=4 AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA kmer index AUUC 1 UUCG 3

  33. UUCC 2 Filtering algorithm 1. Build a hash table of kmers in the DB k=4 AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA kmer index AUUC 1 UUCG 3 CCGG 4

  34. UUCC 2 Filtering algorithm 1. Build a hash table of kmers in the DB k=4 AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA kmer index AUUC 1 14 UUCG 3 CCGG 4

  35. UUCC 2 Filtering algorithm 2. Identify (k,w)-stacks k=4 AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA kmer index AUUC 1 14 reverse complement UUCG 3 CCGG 4 n GAAU

  36. UUCC 2 Filtering algorithm 2. Identify (k,w)-stacks d  w? k=4 AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA kmer index AUUC 1 14 reverse complement UUCG 3 CCGG 4  n GAAU

  37. Filtering algorithm 3. Compute complex stacks using DP nested parallel multiloop

  38. Result of stage 1 (filtering) AGAGCGUAUCGAUUUAGAGAGCUAUAGCUAGAGAGGAGA UUAUAGCGCGCAUAUAGGACAAACAGUCUCUAUGGGGAC AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCA AUCGCGCUAUAGCUAGCGAGGACAGCUAUAGCUAGCGAG AUAUCGGGCUGUGGACACUAUACGAUCGAAUCUAGCUAU AUCGCGCUAUAGCUAGCGAGGACAGCUAUAGCUAGCGAG AUAUCGGGCUGUGGACACUAUACGAUCGAAUCUAGCUAU

  39. Solution • Stage 1: filter the DB • Stage 2: align the selected sequences in the DB with the query and determine the best alignments

  40. Possible ways to align RNAs • sequence to sequence • structure to structure • sequence to structure

  41. RNA sequence structure alignment AGAGCGUAUCGAUUUAGAGAGCUAUAGCUAGAGAGGAGA DB (filtered) UUAUAGCGCGCAUAUAGGACAAACAGUCUCUAUGGGGAC t [1,……………………………………………………..n] Query s [1,……………………………………………………..m] S (set of base pairings)

  42. The secondary structure of the query is represented by a binarized tree Rule 1: when i and j are paired i - j i +1 j -1 i j

  43. The secondary structure of the query is represented by a binarized tree Rule 2: when j is unpaired j -1 i j

  44. The secondary structure of the query is represented by a binarized tree Rule 3: when j is paired but not to the left-most base k k -1 i j

  45. The secondary structure of the query is represented by a binarized tree i j

  46. The secondary structure of the query is represented by a binarized tree

  47. The secondary structure of the query is represented by a binarized tree

  48. The secondary structure of the query is represented by a binarized tree

  49. The secondary structure of the query is represented by a binarized tree