Random Generation of Structured Genomic Sequences

5/9 4/9 cS aSbS 2/5 1/5 1/2 1/2 2/5 aaSbSbS acSbS abS caSbS ccS 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 accSbS acbS abaSbS abcS cacSbS cabS ccaSbS cccS aabb accb acbc abab abcc cacb cabc ccab cccc Biological sequences UUUUCUGUAAAUAAAAGAGUGCAAUAU CUGCCUCGGAAGAUUAAAGAGAAUGUA GUUUUAAAAUACUUAAUAAAAUAAUUC GUGUUUAAAAUAAAUUUCAUUCAGACU Model A,C,G,U A,C,G,U Grammatical inference A,U A U A A A Random generation Grammatical inference Random sequences GAAUAAAUAAAAGCUUUTGAGAGCUAA AUGCCAGGAAGCAAUAAAGGACAAUAA AAAAUUUCACUUAAUAAAAUAGUUAUU UAAAAAUAAAUGGUUGUCAAUUCUGCG Grammar: S aSbS | c S |  Non-uniform “controlled” random generation. We aim to model statistical parameters as well as structural ones, by adjusting the expected number of occurrences of each letter in the generated words. For this purpose, we give a weight (x) to each letter x, and define the weight of a word as the product of the weights of its letters. A slight modification of the generation algorithm allows to generate any word with a probability proportional to its weight. Given the weighted generating series of the language: the expected frequency of any letter xi is given by : where and The generation is uniform in the particular case where all weights are assigned to 1. With our grammar, this leads to a = b =c =1/3, when n is large. Assigning weight 1/2 to letter c gives a = b= 2/5 and c =1/5, leading to more paired nucleotides in the corresponding secondary structures. Length: n = 4 S c a a a a a c c c b b b a c c a a c c c c b b c c b c c b b S aSbS | c S |  Random Generation of Structured Genomic Sequences Objectives and Applications Take into account structural features, as well as statistical ones, in models of random sequences. UCCAACAUCACAGCCCAGCCCACCCACUGGGUAAUAAAAGUGGUUUGUGG AUCACUGUAAUUAUUAUUAUUUUCUACAAUAAAUGGGACCUGUGCACAGG GACCCAGAUGGGAUGUUCGGAUCGGUUUGUAAUUAAACCUGGGAAUGGCC GGAUACACAAAUAAGUCAGUUAAAAUACAUAAAUAAAAACAUAAACCUGC CAGGAGGGGAACGUGGUAAAACCCAAGACAUUAAAUCUGCCAUCUCAGGC UUUUUUGUUUCAGUACCAGAGGCACUGACUUCAAUAAAGUUUAUUUAUAC GAACUCUGCCCUGCCUGGGACUCUAUUUAUUCUGAUUAAAGGGGUUUUGC GAAAAUAAUGAAUGAAAAUCAACAGAUGAAUAAAUGGUUCUUUAUAAGUG AGGCCAGCCAGCUUGGGAGCAGCAGAGAAAUAAACAGCAUUUCUGAUGCC ACCGGGGAAGCCGUCAGCUGCUGUGACAAUAAAACCUGCCCCGUGUCUGG UUGAAAUUGAAAUGCUUUAUCUGUGUUUUCUGUAAAUAAAAGAGUGCAAU GGUCUGCUCCCCACCCCUGCCUCGGAAGAAUAAAGAGAAUGUAGUUCCCU UGUUAAGUAGUUGUUUUAAAAUACUUAAUAAAAUAAUUCUUUUCCUGUGG GAAAUGUAAAGAUUACUUGAGGUGUUUAAAAUAAAUUUUUCAUUCAGACU ACUACCAUCUCUCUCUUAAAACGAGAUCAGGUUAGCAAAUGAUGUAAAAG UUUAACCGUAUGUAAACUUGGUUUCUAAUUAAAAAAAAAUUUCUUUUUCC GUUAUUUUGUACUUGUCUUAAUACACUAAGUGUAAUAAAAACGGCUUGAG UUUUUUUCUUUUUUGCUACUGCAAACGAUGCUAUAAUAAAUGUCCUUAUC UCUAUUUUUUCUCUCCUUUUCUUUUCUUCAAUAAAAAGAAUUAAAAACCC GCUGGGGAGGGGGGAGGGGAACUUUGUUGGGAAUAAACUUCACUCUGUGG UGCAUCUUCCAAAGCUAUUUCGAAAUAAACACGAAAAUUUACAGUUUGCC AGAUUAUUUGUGAUCCCAUCCAUUCCCCAAUAAAGCAAGGCUUGUCCGAC Example 3: Validation of grammatical inference methods. A well-suited grammatical inference method is due to infer the same model, given the original set of biological sequences, or given a set of random sequences designed according to the model. Example 1: Take into account a known strong signal in order to study weaker signals. (Here : polyadenylation patterns in human genome. Source : Beaudoing & al. 2000.) Example 2: Thresholds for comparison of RNA secondary structures. As for sequences, scores of comparison of random secondary structures may help to define thresholds for homology and non-homology of RNAs. (Source of image : Dulucq & Tichit, 2001.) Context-free Languages and Non-uniform Random Generation Uniform random generation of words of a context-free language (see e.g. Flajolet & al. 1994). Given the desired length n and an unambiguous grammar of the language, start from the axiom S and, at each step, choose a rule with a suitable probability, so that all words of length n are uniformly distributed. (In the above example, any word of length n=4 occurs with probability 1/9.) Worst-case complexity: O(n logn) Encoding structural features by context-free languages. In the above example, secondary structures are encoded by words of a (very simple) context-free language. In this language, a’s and b’s code for paired bases, as c’s code for unpaired ones. The corresponding context-free grammar allows to generate words of the language. A weighted context-free grammar based model for RNA GenRGenS A general grammar for RNA. We use a hierarchical decomposition inspired by Waterman & al. 78. Different sub-structures (Loop, Hairpin, Ladder …) are generated using different vocabularies. Weighted generation of RNA structures. Assigning a weight to a symbol allows to control its expected number of occurrences. Our grammar contains special symbols, called markers. Each marker occurs only once per occurrence of the corresponding sub-structure (Loop, Hairpin, Ladder …). Therefore, assigning weights to these markers in addition to regular terminal symbols constrains the expected number and average length of the sub-structures. GenRGenS is a toolbox dedicated to random generation of genomic sequences. GenRGenS uses a cross-platform technology (Java) and is freely distributed under the terms of the Gnu Public Licence at: http://www.lri.fr/~denise/GenRGenS/ Weighted Random Generation Uniform Random Generation GenRGenS provides support for various random models, including homogeneous and heterogeneous Markovian models, regular and PROSITE patterns (soon available) and weighted context-free grammars (CFG). Drawings Thanks to RNAViz Uniformly generated structures lack of biological relevance !

Random Generation of Structured Genomic Sequences

Random Generation of Structured Genomic Sequences

Presentation Transcript

Random Number Generation

Random Number Generation

Random Number Generation

Spatially structured random effects

Algorithms for Alignment of Genomic Sequences

Random Number Generation

Finding Promoters other important genomic sequences

Random-Number Generation

Alignment of large genomic sequences

Random- Variate Generation

Random Variate Generation

Alignment of Genomic Sequences

Random-Number Generation

Locus Reference Genomic (LRG) Sequences

Random-Number Generation

Random-Number Generation

Inferring Genomic Sequences

High Performance Direct Pairwise Comparison of Genomic Sequences

Random-Variate Generation

Random Number Generation

Random Number Generation

Alignment of large genomic sequences