1 / 16

Speeding-up Parsing of Biological Context-Free Grammars

Speeding-up Parsing of Biological Context-Free Grammars. D.F. Fredouille C.H. Bryant School of Computing, The Robert Gordon University Aberdeen, UK. Definitions: Grammars. Alphabet: {a,c,g,t} Sequence: actttgtcgtaaatgg

laurie
Download Presentation

Speeding-up Parsing of Biological Context-Free Grammars

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speeding-up Parsing of Biological Context-Free Grammars D.F. Fredouille C.H. Bryant School of Computing, The Robert Gordon University Aberdeen, UK

  2. Definitions: Grammars • Alphabet: {a,c,g,t} • Sequence: actttgtcgtaaatgg • Language: {actttgtcgtaaatgg, agtaactttgtcg, ctttgtatgccaag, ... } • Context-Free Grammar (CFG): • Rewriting rules - represent a language • { Start → Gap c t t t g t Gap, Gap → ε | X Gap, X → a | c | g | t }

  3. Definitions: Bio. sequences Alphabet : {a,c,t,g} Example: agtaactttgtcg Alphabet : {a,c,u,g} Example: aguaacuuugucg Alphabet : 20 letters Example: pvypgdnaadssiekqvallk

  4. Motivation for Fast Parsing • Grammar models are widely used as models for biological sequences: • Prosite motifs, SVG, BGG, … • Need fast parsers for molecular biology: • Many sequences to parse when searching for novel members of a biological family • Parsing in many grammars when annotating newly discovered sequences with their family

  5. Parsing in biological CFGs • Basically, two parsing algorithms for CFGs • Depth-first, top-down parsing (DFTDP) • Chart parsing (CP) Many others exist but are restricted to subsets of CFGs • Should we use DFTDP or CP ? • Can we improve efficiency when dealing with biological grammars ?

  6. Outline of Our Work • Preliminary experiments showed that parsing speed in biological CFGs is strongly dependent on gap rules. • Theoretical complexity study of the algorithms with respect to gap rules • Improved the algorithms’ management of gaps • Empirical comparison of the algorithms on biological sequences and grammars (which naturally contains gaps)

  7. Definitions: Gap rules • An unlimited gap is a non-terminal which can match any sequence. • Right-rec. : { GapR → ε, GapR → X GapR } • Left-rec. : { GapL → ε, GapL → GapL X } • A limited gap is a non-terminal which can match any sequence s with lo ≤ |s| ≤up. • Form1: { Gap1 → Xlo Xe(up-lo), Xe → X, Xe → ε } • Form2: { Gap2 → Xi : lo ≤ i ≤ up }

  8. Theoretical comparison • Unlimited gaps: • GapL: can not be parsed with DFTDP • GapR: O( |s|) • Limited gaps: • Form1: O( 2up-lo) • Form2: O((up-lo)2 ) • Unlimited gaps: • GapL: O(|s|2) but under some reasonable hypotheses O( |s| ) • GapR: O( |s|2 ) • Limited gaps: Form1 > Form2 + O( up-lo ) • Optimisations: O( |s|) andO( up-lo) DFTDP CP

  9. Empirical Comparison • Protein grammars and sequences • Sequences from the Uniprot database. • Grammars from the Prosite database ( “simple” grammars). • Motivation: largest DBs of protein sequences and hand-validated protein grammars. • DNA grammars and sequences • Grammars from UTRsite (untranslated regions of RNA) • Sequences from the UTRdb • Motivation: one of the rare places where many DNA grammars are available

  10. CP L+F1 CP L+F2 DFTD R+F1 DFTD R+F2 Parsing time in seconds CP Opt. DFTD Opt. Length of the parsed string Empirical Comparison - Proteins Unlimited gap: L = left recursive R = right recursive Limited gap: F1 = form1 F2 = form2

  11. CP L+F1 CP L+F2 DFTD R+F1 DFTD R+F2 Parsing time in seconds CP Opt. DFTD Opt. Length of the parsed string Empirical Comparison - Proteins Optimised Conclusion 1:If you program a parsing algorithm, creating special treatments for gaps can speed-up parsing

  12. CP L+F1 CP L+F2 DFTD R+F1 DFTD R+F2 Parsing time in seconds CP Opt. DFTD Opt. Length of the parsed string Empirical Comparison - Proteins Fact: Some curves not plotted due to very large running times: CP R+F1, CP R+F2 DFTDP R+F1 for one grammar

  13. CP L+F1 CP L+F2 DFTD R+F1 DFTD R+F2 Parsing time in seconds CP Opt. DFTD Opt. Length of the parsed string Empirical Comparison - Proteins Conclusion 2: When using classical CFG parsing algorithms, design the gap rules carefully

  14. Missing the “0.” in Table 2 Empirical Comparison - DNA L = unlimited gap, left recursive R = unlimited gap, right recursive F1 = limited gap, form1 F2 = limited gap, form2 Conclusion 3: CP is significantly faster than DFTDP when grammars start to be “complex”

  15. Conclusions Theoretical and empirical studies show that: • If you program a parsing algorithm, creating special treatments for gaps can speed-up parsing. • When using classical CFG parsing algorithms, design the gap rules carefully. • DFTDP faster for “simple” grammars, but CP is significantly faster when grammars start to be “complex”.

  16. Acknowledgements • Funding – EPSRC • Industrial Collaborator – GlaxoSmithKline • Simon Topp • Stephen Jupe Software and experiments material http://www.comp.rgu.ac.uk/staff/chb/research/data_sets/cpm05

More Related