1 / 20

High-Performance XML Parsing and Validation with Permutation Phrase Grammar Parsers

High-Performance XML Parsing and Validation with Permutation Phrase Grammar Parsers. Wei Zhang & Robert van Engelen Department of Computer Science Florida State University. Presentation Overview. Schema-specific Parsers Related Work

thina
Download Presentation

High-Performance XML Parsing and Validation with Permutation Phrase Grammar Parsers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-Performance XML Parsing and Validation with Permutation Phrase Grammar Parsers Wei Zhang & Robert van Engelen Department of Computer Science Florida State University IEEE ICWS 2008

  2. Presentation Overview • Schema-specific Parsers • Related Work • PTDX: Table-Driven XML Parser with Permutation Phrase Grammar • Performance • Conclusion IEEE ICWS 2008

  3. Schema-specific parsers • Compile-time vs. Run-time Parsers • Compile-time parsing and validation approaches use specialized compilation techniques to generate customized parsers from schemas • Run-time approaches use generic drivers( or engines) and grammar-like representation of schemas • Blocking vs. non-blocking Parsers • Blocking parsers may suspend the entire program for sufficient XML content received. E.g. recursive based parsers • Non-blocking parsers always control the program and buffered data can be incrementally supplied • Time-efficient vs. Space-efficient Parsers • Time efficient but encoding many states • Space efficient but with backtracking IEEE ICWS 2008

  4. Related Work • [Van Engelen, 2001] • The earliest work on schema-specific LL(1) recursive descent parser w/ namespace support and validation • [Van Engelen, 2004] • Two-level DFA integrating parsing and validation • [Chiu et al., 2004] • Using nondeterministic generalized automata to merge all aspects of low-level parsing and validation • [Reuter, 2003] • Using Cardinality-Constraint Automaton (CCA) to perform schema-aware validation IEEE ICWS 2008

  5. Related Work (Cont’d) • [Kostoulas et al., 2006] • An efficient parser generator that translates XML schema into a parser either in C or Java • [Matsa, 2007] • Schema-directed interpretive XML parser using special purpose byte-codes. • [Zhang et al., 2006] • A table-driven approach parsing and validating in a single pass • Generator that translates schema in C IEEE ICWS 2008

  6. PTDX: Table-Driven XML Parser with Permutation Phrase • Table-driven grammar-based parser • Extended LL(1) grammar with permutation phrase support • Parsing table is constructed from extended LL(1) permutation grammar • Run-time parser • Generic parsing engine (2-stack PDA) • Both time and space efficient • Predictive parsing • Integrating parsing and validation into a single pass • No buffering • Operating on tokens • Main stack size growing in depth of XMLdata • Auxiliary stack size growing in number of elements of <xs:all>, <xs:attribute> • Non-blocking parser IEEE ICWS 2008

  7. Extended LL(1) Permutation Phrase Grammar LL(1) Parsing Table Mapping Rules Token Table Action Table Constructing PTDX Tables XML Schemas Note: actions are generated from schemas to perform type-checking verification although some validation constraints are incorporated in grammar productions. IEEE ICWS 2008

  8. Mapping Rules • Define translation from schema components to LL(1) grammar productions • Preserve structural constraints • Map Free-ordered schema components (<xs:all>, <xs:attribute>) to permutation grammar IEEE ICWS 2008

  9. <complexType name=“T”> <all> <element name=“a” type=“string” minOccurs=“0”/> <element name=“b” type=“string”/> <element name=“c” type=“string”> </all> </complexType> T → << A || B || C >> A → bA CD eA A → ε B → bB CD eB C → bC CD eC Note: bA and eA representing tokens of starting and closing element “a” Respectively; CD representing token of CDATA Mapping Example IEEE ICWS 2008

  10. Permutation Phrase A permutation phrase is a grammatical phrase that specifies a syntactic construct as any permutation of a set of constituent elements. E.g., the permutation phrase << a || b || c >> recognizes language {abc, acb, bac, bca, cab, cba} IEEE ICWS 2008

  11. top top Two-stack PDA for Parsing Permutation Phrase << a || b || c>> Input: b c a Input: b c a Input: bc a top abc top bc ac a Main stack Aux stack Main stack Aux stack Main stack Aux stack 2 3 1 IEEE ICWS 2008

  12. Input: bca Input: Input: bc a bca top top top a a c Main stack Aux stack Main stack Aux stack Main stack Aux stack 4 5 6 Two-stack PDA for Parsing Permutation Phrase (Cont’d) << a || b || c>> Note: All optional constituent elements are left on auxiliary stack once all non-empty elements have been parsed. IEEE ICWS 2008

  13. PTDX Architecture Hot-swappable IEEE ICWS 2008

  14. Schema-directed Scanner • Optimized by schema • E.g., scanning a specific tag name is more efficient than scanning the generic string then doing comparison • Tokenizer • Breakes XML message into token stream • Token • Defined by element names, attribute names, enumeration values • Classified as starting tags and closing tags • Normalized namespace binding • <namespace, tag_name> IEEE ICWS 2008

  15. Experiment Settings • Test environment • 3.0 GHz, 2GB RAM, Linux 2.6.20-1.2320, GCC 4.1.1 with option -02 • Memory-resident message • Randomly arranged free ordered elements • Compared with • Validation parsers • gSOAP 2.7 • Xerces 2.7.0 • pTDX flex based parser • Non-validation parsers • Expat 2.0.1 • DFA-based parser IEEE ICWS 2008

  16. Test Cases IEEE ICWS 2008

  17. Better performance Performance: comparison of validating and non-validating parsers IEEE ICWS 2008

  18. Better performance Performance: effect of number of elements in <xs:all> of PTDX parser IEEE ICWS 2008

  19. Performance: runtime and compile time memory usage comparison(32 <xs:all> elements) IEEE ICWS 2008

  20. Conclusion • Free ordered constraints can be parsed and validated efficiently using a 2-stack PDA • Table-driven permutation phrase grammar parsing technique is time and space optimal • Table-driven approach offers flexible framework for dealing with schema evolvement IEEE ICWS 2008

More Related