html5-img
1 / 1

Random Generation of Structured Genomic Sequences

5/9. 4/9. c S. a S b S. 2/5. 1/5. 1/2. 1/2. 2/5. aa S b S b S. a c S b S. ab S. c a S b S. cc S. 1/2. 1/2. 1/2. 1/2. 1/2. 1/2. 1/2. 1/2. a cc S b S. a c b S. aba S b S. ab c S. c a c S b S. c ab S. cc a S b S. ccc S. aabb. a cc b. a c b c. abab. ab cc. c a c b.

otylia
Download Presentation

Random Generation of Structured Genomic Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 5/9 4/9 cS aSbS 2/5 1/5 1/2 1/2 2/5 aaSbSbS acSbS abS caSbS ccS 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 accSbS acbS abaSbS abcS cacSbS cabS ccaSbS cccS aabb accb acbc abab abcc cacb cabc ccab cccc Biological sequences UUUUCUGUAAAUAAAAGAGUGCAAUAU CUGCCUCGGAAGAUUAAAGAGAAUGUA GUUUUAAAAUACUUAAUAAAAUAAUUC GUGUUUAAAAUAAAUUUCAUUCAGACU Model A,C,G,U A,C,G,U Grammatical inference A,U A U A A A Random generation Grammatical inference Random sequences GAAUAAAUAAAAGCUUUTGAGAGCUAA AUGCCAGGAAGCAAUAAAGGACAAUAA AAAAUUUCACUUAAUAAAAUAGUUAUU UAAAAAUAAAUGGUUGUCAAUUCUGCG Grammar: S aSbS | c S |  Non-uniform “controlled” random generation. We aim to model statistical parameters as well as structural ones, by adjusting the expected number of occurrences of each letter in the generated words. For this purpose, we give a weight (x) to each letter x, and define the weight of a word as the product of the weights of its letters. A slight modification of the generation algorithm allows to generate any word with a probability proportional to its weight. Given the weighted generating series of the language: the expected frequency of any letter xi is given by : where and The generation is uniform in the particular case where all weights are assigned to 1. With our grammar, this leads to a = b =c =1/3, when n is large. Assigning weight 1/2 to letter c gives a = b= 2/5 and c =1/5, leading to more paired nucleotides in the corresponding secondary structures. Length: n = 4 S c a a a a a c c c b b b a c c a a c c c c b b c c b c c b b S aSbS | c S |  Random Generation of Structured Genomic Sequences Objectives and Applications Take into account structural features, as well as statistical ones, in models of random sequences. UCCAACAUCACAGCCCAGCCCACCCACUGGGUAAUAAAAGUGGUUUGUGG AUCACUGUAAUUAUUAUUAUUUUCUACAAUAAAUGGGACCUGUGCACAGG GACCCAGAUGGGAUGUUCGGAUCGGUUUGUAAUUAAACCUGGGAAUGGCC GGAUACACAAAUAAGUCAGUUAAAAUACAUAAAUAAAAACAUAAACCUGC CAGGAGGGGAACGUGGUAAAACCCAAGACAUUAAAUCUGCCAUCUCAGGC UUUUUUGUUUCAGUACCAGAGGCACUGACUUCAAUAAAGUUUAUUUAUAC GAACUCUGCCCUGCCUGGGACUCUAUUUAUUCUGAUUAAAGGGGUUUUGC GAAAAUAAUGAAUGAAAAUCAACAGAUGAAUAAAUGGUUCUUUAUAAGUG AGGCCAGCCAGCUUGGGAGCAGCAGAGAAAUAAACAGCAUUUCUGAUGCC ACCGGGGAAGCCGUCAGCUGCUGUGACAAUAAAACCUGCCCCGUGUCUGG UUGAAAUUGAAAUGCUUUAUCUGUGUUUUCUGUAAAUAAAAGAGUGCAAU GGUCUGCUCCCCACCCCUGCCUCGGAAGAAUAAAGAGAAUGUAGUUCCCU UGUUAAGUAGUUGUUUUAAAAUACUUAAUAAAAUAAUUCUUUUCCUGUGG GAAAUGUAAAGAUUACUUGAGGUGUUUAAAAUAAAUUUUUCAUUCAGACU ACUACCAUCUCUCUCUUAAAACGAGAUCAGGUUAGCAAAUGAUGUAAAAG UUUAACCGUAUGUAAACUUGGUUUCUAAUUAAAAAAAAAUUUCUUUUUCC GUUAUUUUGUACUUGUCUUAAUACACUAAGUGUAAUAAAAACGGCUUGAG UUUUUUUCUUUUUUGCUACUGCAAACGAUGCUAUAAUAAAUGUCCUUAUC UCUAUUUUUUCUCUCCUUUUCUUUUCUUCAAUAAAAAGAAUUAAAAACCC GCUGGGGAGGGGGGAGGGGAACUUUGUUGGGAAUAAACUUCACUCUGUGG UGCAUCUUCCAAAGCUAUUUCGAAAUAAACACGAAAAUUUACAGUUUGCC AGAUUAUUUGUGAUCCCAUCCAUUCCCCAAUAAAGCAAGGCUUGUCCGAC Example 3: Validation of grammatical inference methods. A well-suited grammatical inference method is due to infer the same model, given the original set of biological sequences, or given a set of random sequences designed according to the model. Example 1: Take into account a known strong signal in order to study weaker signals. (Here : polyadenylation patterns in human genome. Source : Beaudoing & al. 2000.) Example 2: Thresholds for comparison of RNA secondary structures. As for sequences, scores of comparison of random secondary structures may help to define thresholds for homology and non-homology of RNAs. (Source of image : Dulucq & Tichit, 2001.) Context-free Languages and Non-uniform Random Generation Uniform random generation of words of a context-free language (see e.g. Flajolet & al. 1994). Given the desired length n and an unambiguous grammar of the language, start from the axiom S and, at each step, choose a rule with a suitable probability, so that all words of length n are uniformly distributed. (In the above example, any word of length n=4 occurs with probability 1/9.) Worst-case complexity: O(n logn) Encoding structural features by context-free languages. In the above example, secondary structures are encoded by words of a (very simple) context-free language. In this language, a’s and b’s code for paired bases, as c’s code for unpaired ones. The corresponding context-free grammar allows to generate words of the language. A weighted context-free grammar based model for RNA GenRGenS A general grammar for RNA. We use a hierarchical decomposition inspired by Waterman & al. 78. Different sub-structures (Loop, Hairpin, Ladder …) are generated using different vocabularies. Weighted generation of RNA structures. Assigning a weight to a symbol allows to control its expected number of occurrences. Our grammar contains special symbols, called markers. Each marker occurs only once per occurrence of the corresponding sub-structure (Loop, Hairpin, Ladder …). Therefore, assigning weights to these markers in addition to regular terminal symbols constrains the expected number and average length of the sub-structures. GenRGenS is a toolbox dedicated to random generation of genomic sequences. GenRGenS uses a cross-platform technology (Java) and is freely distributed under the terms of the Gnu Public Licence at: http://www.lri.fr/~denise/GenRGenS/ Weighted Random Generation Uniform Random Generation GenRGenS provides support for various random models, including homogeneous and heterogeneous Markovian models, regular and PROSITE patterns (soon available) and weighted context-free grammars (CFG). Drawings Thanks to RNAViz Uniformly generated structures lack of biological relevance !

More Related