xml data compression n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
XML Data Compression PowerPoint Presentation
Download Presentation
XML Data Compression

Loading in 2 Seconds...

play fullscreen
1 / 49

XML Data Compression - PowerPoint PPT Presentation


  • 198 Views
  • Uploaded on

XML Data Compression. Greg Leighton, Jim Diamond, Tomasz Müldner February 18, 2005. Overview. A (brief) introduction to data compression XML lossless data compression New XML Compression Programs AXECHOP TREECHOP. XML Data Compression. A (brief) introduction to XML

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'XML Data Compression' - verdad


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
xml data compression

XML Data Compression

Greg Leighton, Jim Diamond, Tomasz Müldner

February 18, 2005

overview
Overview
  • A (brief) introduction to data compression
  • XML lossless data compression
  • New XML Compression Programs
    • AXECHOP
    • TREECHOP
xml data compression1
XML Data Compression
  • A (brief) introduction to XML
  • Techniques for achieving XML compression
    • Traditional approaches – Huffman, LZ
    • Specialized approaches
  • XML Compression Programs
    • XMill
    • XGrind
    • XPRESS
e x tensible m arkup l anguage
eXtensible Markup Language
  • separate syntax from semantics
  • support semi-structured data
  • support internationalization and platform independence
  • is self-describing (through labeling of the tree)
e x tensible m arkup l anguage 2
eXtensible Markup Language : 2

XML is a framework for defining markup languages:

  • no fixed collection of markup tags
  • each XML language is specialized for its own application domain
  • a common set of generic tools supports processing documents

XML: textual convention to represent tagged trees

e x tensible m arkup l anguage 3
eXtensible Markup Language : 3

<?xml version=“1.0” encoding=“UTF-8”?>

<Employees>

<Employeeid=“123456”>

<Name>Homer Simpson</Name>

<Department>Sector 7-G</Department>

</Employee>

<Employeeid=“123457”>

<Name>Frank Grimes</Name>

<Department>Sector 7-G</Department>

</Employee>

</Employees>

Element

Attribute

Data Value

e x tensible m arkup l anguage 4
eXtensible Markup Language : 4
  • Correctness of an XML document:
    • Well-formed: complies with XML syntax
    • Valid: obeys the structure described in a grammar, such as XML schema document
  • Two kinds of XML parsers:
    • SAX
    • DOM
why compress xml
Why Compress XML?

XML is verbose:

  • Each non-empty element tag must end with a matching closing tag -- <tag>data</tag>
  • Ordering of tags is often repeated in a document (e.g. multiple records)
  • Tag names are often long
xml compressors
XML Compressors
  • View XML as a tree
  • Separate the tree structure and what is stored in leaves
  • Save the tree structure so that it can be restored
  • The compressed file may or may not remember the tree structure

breadfruit tree

xmill liefke and suciu

Book

T1

T2

Title

Author

T4

Author

T4

@lang

T3

Miller

C5

Tai

C5

Views

C4

English

C3

<Book><Title lang="English">Views</Title>

<Author>Miller</Author>

<Author>Tai</Author>

</Book>

T1 T2 T3 C3 / C4 / T4 C5 / T4 C5 / /

T1

T2

T3 C3 /

C4 /

T4 C5 /

T4 C5 /

/

XMill: Liefke and Suciu
  • Tree structure:
    • Start tags and attribute names are dictionary-encoded (as T1, T2, etc.)
    • End tags replaced with ‘/’ token
    • Data values are replaced with their container number
xmill liefke and suciu1
XMill: Liefke and Suciu
  • For each ending tag or attribute there is a separate data container
  • Different semantic compressors can be used for various containers (gzip)
  • Compressed file does not remember the original structure
xmill decompression
XMill: Decompression
  • Decompressor loads and unzips each container and the decompressed structure container is parsed
  • Whenever a data value is found in the structure, the next value is pulled from the corresponding data container and the appropriate semantic decompressor (if applicable) is applied to get back the original data value
xgrind tolani haritsa
XGrind: Tolani & Haritsa
  • The structure of original XML document is retained by the compression process:Compress at the granularity of individual element and attribute values.
xgrind
XGrind
  • Operations on the compressed file
    • Querying the compressed document
      • Exact and prefix-match require no decompression
      • Range or partial-match require on-the-fly decompression of element/attribute values that appear in the query
    • Updates
    • Testing Validity (against the compressed DTD)
xgrind implementation
XGrind: Implementation
  • A context-free compression must be used(the code assigned to a string does not depend on the location of this string). There are several types of compressors:
    • Tags (as in XMill)
    • Enumerated values (simple compressor, uses DTD)
    • Element/attribute value compressor:Non-adaptive Huffman coding scheme. A separate Huffman tree is calculated for every non-enumerated data element
xgrind1

<Book><Title lang="English">Views</Title>

<Author>Miller</Author>

<Author>Tai</Author>

</Book>

T1

T2 A1 nh(English) /

nh(Views) /

T4 nh(Miller) /

T4 nh(Tai) /

/

[ nh(s): output from Huffman compressor for s ]

Book

T1

T2

Title

Author

T4

Author

T4

@lang

A1

Miller

C5

Tai

C5

Views

C4

English

C3

XGrind
  • Tree structure:
    • Start tags and attribute names are dictionary-encoded (as T1, T2, for tags and A1, A2 for attributes)
    • End tags replaced with ‘/’ token
xgrind querying
XGrind: Querying
  • The query engine works on the compressed document; it consists of a lexer and a parser
  • The query (the path and the predicate) are compressed
  • The parser checks if the current path matches the query path and the compressed data value satisfy the compressed predicate
xmlppm
XMLPPM
  • A modification of the prediction-by-partial matching (PPM) text compression scheme.
  • To process XML data, the encoder chooses the appropriate PPM model from a set of several models depending on the current context supplied by the built-in SAX parser
axechop and treechop
AXECHOP and TREECHOP
  • AXECHOP: attempts to achieve highest-possible compression ratio by reordering original document
  • TREECHOP: willing to sacrifice a bit on compression ratio in order to preserve original XML structure and enable querying to be carried out on compressed document
axechop key features
AXECHOP: Key Features
  • Uses a grammar-based approach for compressing XML document structure
  • Outperforms general-purpose text compressors (e.g. gzip) by as much as 30% on XML
  • Operates offline (decompression can’t start until entire compressed file has been received)
    • Suited for XML data archiving, not for XML messaging applications (e.g. Web Services)
grammar based compression
Grammar-based Compression
  • Achieves compression by producing a context-free grammar that uniquely derives the input sequence
  • Define a separate production for each repetition in the input
    • For second and subsequence occurrences, encode the LHS of the production rather than the pattern on the RHS
grammar based compression example
Grammar-based Compression: Example

Original Input:

abcdbcabc

Generated Grammar:

S  aAdAaA

A  bc

axechop compression strategy
AXECHOP: Compression Strategy
  • Perform a re-ordering of the XML document during SAX parsing
    • Use a byte-based encoding scheme to record the structure of the document – the “structure string”
    • Place data values for each element and attribute in a separate container to localize repetitions
axechop compression strategy 3
AXECHOP: Compression Strategy 3
  • Apply Multilevel Pattern Matching (MPM) algorithm to obtain grammar-based compression of the document structure
  • Compress the contents of each data value container using the Burrows-Wheeler block-sorting algorithm
  • Write compressed data to output file
axechop compression example 3
AXECHOP Compression: Example 3

Original Structure String:

1 132 2 128 130 3 4 133 128 130 4 133 128 130 130 130

MPM-Generated Grammar:

axechop decompression strategy
AXECHOP: Decompression Strategy
  • Decompress the MPM code to obtain the document structure
  • Perform inverse BWT to get back the contents of each data value container
  • Perform a single pass through the reconstituted structure string
axechop implementation
AXECHOP: Implementation
  • Written in C++
  • Designed to be modular
    • instead of using MPM as structural compressor, can insert a different compressor
    • BWT can be swapped with a different container compressor
axechop conclusions
AXECHOP: Conclusions
  • AXECHOP achieves 2nd best average compression rate over a varied corpus of XML files
  • Future work:
    • Speed up compression through code optimization
    • Use a form of PPM in place of BWT for dictionary compression (PPM often achieves a better compression rate but tends to be slow)
    • Define an XML-conscious grammar-based compression scheme, instead of using “general-purpose” MPM
treechop key features
TREECHOP: Key Features
  • Carries out an online compression of the XML document tree
  • Since original document structure is maintained throughout compression, querying can be carried out without requiring decompression
  • Intended for XML messaging scenarios, where documents are being transmitted over a network
  • Encoding and querying strategies are based on the XPath standard
xpath
XPath
  • A W3C standard for identifying particular nodes of an XML document tree
  • Syntax is similar to that used for pathnames in UNIX

<class>

<students>

<instructor>

<student>

In XPath: /class/students/student

treechop compression strategy
TREECHOP: Compression Strategy
  • Perform SAX parsing
  • Generate document tree
  • Assign a binary code word to each non-leaf tree node
  • Write tree encoding to compression stream
treechop generating the tree
TREECHOP: Generating the Tree
  • Encoding has 3 important properties:
    • Each tree node inherits its parent’s code as a prefix
    • Two nodes share the same code word iff they have the same XPath location, as traced from tree root downwards
    • Maintains the structure of the original document throughout the compression process
treechop tree encoding scheme
TREECHOP: Tree Encoding Scheme
  • Given a non-leaf node N with parent node P, where N is the i-th distinct child node of P:
treechop example tree
TREECHOP: Example Tree

<class>

@name

<instructor>

<students>

“COMP 5113”

“Bob Smith”

<student>

<student>

@id

@id

“Pete Wilson”

“Lola Richardson”

“100000”

“100001”

treechop example tree1
TREECHOP: Example Tree

00

<class>

00011

0000

00010

@name

<instructor>

<students>

“COMP 5113”

“Bob Smith”

0001100

0001100

<student>

<student>

000110000

000110000

@id

@id

“Pete Wilson”

“Lola Richardson”

“100000”

“100001”

treechop writing the tree
TREECHOP: Writing the Tree
  • Encoded tree is written to compression stream in depth-first order
  • Non-leaf nodes: written as 4-tuple (L, C, T, D)
    • L is a byte indicating bit length of code word
    • C is a sequence of bytes containing code word
    • T is a byte indicating node type (e.g. element)
    • D is textual data stored in the node (e.g. element name) - reserved byte values are used to signal beginning/end of stream of raw character data
treechop writing the tree 2
TREECHOP: Writing the Tree 2
  • For 2nd and subsequent occurrences of a non-leaf node, only the 2-tuple (L, C) is transmitted – decoder can then infer T and D
  • Leaf nodes are written in the manner of D, above – as a stream of raw character data
treechop decompression strategy
TREECHOP: Decompression Strategy
  • A code table is used to keep track of code words processed thus far
  • Allows future occurrences of a particular (L, C) pair to be mapped to the proper data type & value
treechop decompression strategy 2
TREECHOP: Decompression Strategy 2
  • To maintain proper nesting, a stack is used
    • When a new tree node is processed, continue popping until the node on top of stack does not share a common code word prefix with current node
    • Last popped node is the parent of the current node
treechop querying
TREECHOP: Querying
  • Queries are expressed using XPath
  • Once the equivalent code word for the query predicate has been determined, query matches can be quickly located by searching through the compression stream for other occurrences of the code word
treechop querying example 2 search for all occurrences of class students student
TREECHOP: Querying Example 2Search for all occurrences of ‘/class/students/student’
  • Discover code word for ‘/class’  00
  • Discover code word for ‘/class/students’  00011
  • Discover code word for ‘/class/students/student’  0001100
  • Extract data contained in next occurring leaf node  “Pete Wilson”
  • Scan through remainder of compression stream, looking for occurrences of code word 0001100 – occurs once more and the associated data value (“Lola Richardson”) is extracted
treechop current state
TREECHOP: Current State
  • Java-based implementation is partially completed
  • Modeled after existing java.net package (e.g. XMLSocket corresponds to Socket)
future work
Future Work
  • Finish implementation of TREECHOP
  • Use TREECHOP to validate compressed document using compressed grammars
  • Applications for XML filtering
  • Compressed stylesheets