xml compression techniques l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
XML Compression Techniques PowerPoint Presentation
Download Presentation
XML Compression Techniques

Loading in 2 Seconds...

play fullscreen
1 / 48

XML Compression Techniques - PowerPoint PPT Presentation


  • 185 Views
  • Uploaded on

XML Compression Techniques. Gregory Leighton Web Data Management Lab Department of Computer Science University of Calgary June 24, 2008. Outline. XML primer XML-conscious compression schemes Non-queryable schemes Queryable schemes Future directions. The Extensible Markup Language (XML).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'XML Compression Techniques' - omer


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
xml compression techniques

XML Compression Techniques

Gregory Leighton

Web Data Management Lab

Department of Computer Science

University of Calgary

June 24, 2008

outline
Outline
  • XML primer
  • XML-conscious compression schemes
    • Non-queryable schemes
    • Queryable schemes
  • Future directions

XML Compression Techniques

the extensible markup language xml
The Extensible Markup Language (XML)
  • A W3C-endorsed standard
  • Originally intended as a web document authoring format, has since become a popular method for encoding semi-structured data
    • Data integration, data exchange applications
  • Support for native (tree-based) storage of XML in most commercial and open source DBMSs
    • MySQL, DB2, Oracle, SQL Server
  • A markup language: text content (PCDATA) can be surrounded by descriptive markup (elements and attributes)

XML Compression Techniques

example an xml document
Example: An XML Document

<course>

<name>CS 501</name>

<instructor>Ron Charles</instructor>

<students>

<student name=“Alice“>

<a1>78</a1>

<a2>86</a2>

<midterm>91</midterm>

<project>87</project>

</student>

<student name=“Bob“>

<a1>69</a1>

<a2>71</a2>

<midterm>82</midterm>

<finalexam>60</finalexam>

</student>

</students>

</course>

XML Compression Techniques

xml data model
XML Data Model
  • XML documents can be modeled as ordered, labeled trees
    • Preserves ancestor-descendent relationships and sibling ordering
  • Each node is assigned a unique ordinal value corresponding to its position in a pre-order traversal
  • Internal nodes represent elements or attributes
  • Leaf nodes represent text content (PCDATA) or attribute values
  • Label of incident edge stores the node’s class (for internal nodes) or text content (for leaf nodes)

XML Compression Techniques

example xml document tree
Example: XML Document Tree

Path Expression:

/course

Textual Representation:

<course></course>

Path Expression:

/course/name/text()

Textual Representation:

<course>

<name>CS 501</name>

</course>

Path Expression:

/course/name

Textual Representation:

<course>

<name></name>

</course>

course

name

o1

instructor

students

o2

o4

student

student

o6

“CS 501”

“Ron Charles”

o5

o7

o3

o18

@name

@name

a1

finalexam

a1

project

a2

a2

midterm

midterm

o16

o8

o10

o12

o14

o19

o21

o23

o25

o27

“Alice”

“86”

“91”

“87”

“Bob”

“78”

“69”

“71”

“82”

“60”

o13

o9

o11

o15

o17

o20

o22

o24

o26

o28

XML Compression Techniques

xml technologies
XML Technologies

Query Languages

Parsing APIs

Document Object Model (DOM): produces an in-memory representation of document tree

Efficient traversal

Large memory consumption

Simple API for XML (SAX): event-based, depth-first parsing of document

Much smaller memory consumption

Some navigation operations become expensive (allows serial access only)

  • XPath allows tree nodes to be selected based on their position and/or value
    • /course/name/text()
    • //student[@name=“Alice”]
  • XQuery: a higher-level, declarative query language that incorporates XPath

XML Compression Techniques

xml conscious compression schemes
XML-Conscious Compression Schemes

XML Compression Techniques

taxonomy
Taxonomy

compressors

XML-conscious

  • generic text
  • bzip2
  • gzip
  • PPM variants
  • Etc.

queryable

  • non-queryable
  • AXECHOP
  • DTDPPM
  • SCMPPM
  • XComp
  • XMill
  • XMLPPM
  • XWRT
  • sequential
  • access
  • BPLEX
  • TREECHOP
  • XGRIND
  • XPRESS
  • random
  • access
  • XQueC
  • XSeq

XML Compression Techniques

other classification criteria
Other Classification Criteria
  • Schema-aware or schema-oblivious?
    • Information in schema documents can allow document structure to be encoded more succinctly
    • Some schema languages (e.g., XML Schema) specify data types – this knowledge can be used to guide selection of compression schemes
    • Limited applicability: not all documents have an associated schema document
  • Online vs. offline operation
    • Can decompression be carried out incrementally?
  • Compression paradigm used
    • Homomorphic or permutation-based?

XML Compression Techniques

schema aware compression example
Schema-aware Compression: Example

<!ELEMENT course (name,instructor,students)>

Indicates that each <course> element must have <name>, <instructor>, and

<students> elements as children (in that order): no unpredictability

If encoder and decoder both possess the DTD: 0 bits needed to represent structure of <course> elements!

<!ELEMENT student (a1,a2,midterm,(finalexam|project))>

Similarly, only 1 bit is needed to indicate whether <finalexam> or <project> appears as the fourth child of <student>

XML Compression Techniques

other classification criteria intended application domains
Other Classification CriteriaIntendedApplication Domains
  • Archiving: volumes of data are compressed to preserve disk space, not accessed frequently
    • E.g., Web server logs
    • Priorities: compression ratio, compression speed
  • Data exchange: for data transferred over a network, the key goals are to improve throughput and reduce bandwidth consumption
    • E.g., web services, instant messaging
    • Priorities: compression/decompression speed, online operation
  • Database/IR applications: declarative queries are issued over XML documents; compression can improve query performance by reducing number of I/O operations
    • E.g., scientific databases
    • Priorities: queryable compression w/ random access, decompression speed

XML Compression Techniques

permutation based approaches
Permutation-based Approaches

Document is rearranged to localize repetitions before passing to back-end compressor(s)

  • Data segments are grouped into different containers, typically based on the identity of parent element
  • Tag structure (“skeleton”) and data segments are compressed separately

XML Document

Shredder

Skeleton

Data Containers

Structure

Compressor

Data

Compressor

XML Compression Techniques

homomorphic approaches
Homomorphic Approaches
  • Each XML token is compressed individually, “in-place”
  • Compression process maintains structure of original document
  • Poorer compression, but easier to query than permutation-based approaches (less fragmentation)

XML Compression Techniques

non queryable compressors
Non-Queryable Compressors

XML Compression Techniques

xmill liefke suciu 2000
XMill (Liefke & Suciu, 2000)
  • Introduced idea of separately compressing document structure and data, container grouping of text segments
  • gzip is used as back-end compressor
  • Data-centric XML: often beats gzip’s compression of medium- to large-sized XML documents (> 20 KB) by 35%-60%
  • Document-centric XML: little to no improvement over gzip’s compression rate
  • Compression/decompression speed comparable to gzip’s

XML Compression Techniques

xmill
XMill

Compression strategy is based on 3 principles:

  • Separation of structure from data
    • Start tags, attributes are assigned a binary codeword
  • Container-based storage of data segments, using path-based partitioning by default
    • Custom partitioning policies can be defined using a container expression language
  • Semantic compressors may optionally be applied to each container
    • E.g., differential encoder for numeric data; specialized compressors for handling specific formats like dates and URLs

XML Compression Techniques

xmill example
XMillExample

<a>

<b>

<c>text 1</c>

</b>

<b>

<c>text 1</c>

</b>

<c>text 3</c>

<d>text 4</d>

</a>

gzip

gzip

gzip

gzip

Compressed File

XML Compression Techniques

xmill related approaches
XMillRelated Approaches
  • XComp (Li, 2003): groups data values into containers based on <label, level, node_type>, then applies gzip to containers and structural summary; little (<2%) to no improvement over XMill’s compression rate or time
  • AXECHOP (Leighton, 2005a): applies grammar-based scheme (MPM) to compress structural summary, uses bzlib for container compression; outperforms XBMill on most documents

XML Compression Techniques

xmlppm cheney 2001
XMLPPM (Cheney, 2001)
  • Compression process is centered around two concepts:
    • Encoded SAX parsing (ESAX): each SAX event is replaced with a more succinct encoding
    • Multiplexed hierarchical modeling (MHM): separate PPM models are maintained for elements, attributes, textual content, and characters
      • Additional symbols are injected into models to preserve the context formed by original tag hierarchy
      • A single arithmetic coder is shared between all models
  • Often compresses 15-35% better than XMill; main drawback is slow operation

XML Compression Techniques

xmlppm related approaches
XMLPPMRelated Approaches
  • SCMPPM (Adiego et al, 2004): maintains a separate model for every distinct element/attribute; often achieves better compression than XMLPPM
  • DTDPPM (Cheney, 2005): consults DTD to increase accuracy of symbol prediction; only effective on small (no more than a few MBs) and highly-structured documents

XML Compression Techniques

xwrt skibi ski et al 2007
XWRT (Skibiński et al, 2007)

XML

Document

  • The “XML Word-Replacing Transform”
  • Pre-processes document in hopes of boosting compression performance of selected back-end scheme (LZ77 or LZMA)
  • XWRT + LZMA often outperforms XMLPPM and SCMPPM in compression ratio, while offering faster performance!

XWRT

Pre-processed

Document

LZ77/LZMA

Compressed file

XML Compression Techniques

xwrt pre processing techniques
XWRTPre-processing Techniques
  • Dynamic dictionary of frequently used words
  • Grouping of data values into containers, based on name of encapsulating element/attribute
  • Use of additional containers to encode some types of data with a predictable format
    • Numbers, dates, times

XML Compression Techniques

non queryable compressors summary
Non-Queryable Compressors: Summary

XML Compression Techniques

queryable compressors
Queryable Compressors

XML Compression Techniques

xgrind tolani haritsa 2002
XGRIND (Tolani & Haritsa, 2002)
  • Encodes elements and attributes using XMill’s approach
  • DTD-conscious: enumerated attributes with k possible values are encoded using a log2k-bit scheme
  • Data values are encoded using non-adaptive Huffman coding
    • Requires two passes over the input document
    • Separate statistical model for each element/attribute
  • Homomorphic compression: compressed document retains original structure

XML Compression Techniques

xgrind
XGRIND

Original Fragment:

Compressed Fragment:

T0 A0 nahuff(Alice)

T1 nahuff(78) /

T2 nahuff(86) /

T3 nahuff(91) /

T4 nahuff(87) /

/

<student name=“Alice“>

<a1>78</a1>

<a2>86</a2>

<midterm>91</midterm>

<project>87</project>

</student>

XML Compression Techniques

xgrind28
XGRIND
  • Many queries can be carried out entirely in compressed domain
    • Exact-match, prefix-match
  • Some others require only decompression of relevant values
    • Range, substring
  • Queryability comes at the expense of achievable compression ratio: typically within 65-75% that of XMill

XML Compression Techniques

xpress min et al 2003
XPRESS (Min et al, 2003)
  • Like XGRIND, features a homomorphic compression process requiring two passes over the document
  • Defines a process called reverse arithmetic encoding for representing each distinct root-to-node label path as an interval in [0.0, 1.0)
    • For intervals I1, I2: if I1 contains I2, then path P1 is a suffix of path P2
    • This feature facilitates execution of ancestor-descendent queries in compressed domain

XML Compression Techniques

xpress
XPRESS
  • Attempts to automatically infer types of data values, and then applies a type-specific compression scheme
    • Numeric data: differential encoding scheme
    • Textual data: 1-byte dictionary encoding scheme (if < 128 distinct values), Huffman otherwise
  • Queries supported in compressed domain:
    • Numeric data: exact match, range queries
    • Textual data: exact match
  • Tends to compress better than XGrind, while also operating 2-3 times faster on typical XPath queries
  • XMill’s compression rate tends to be ~20-25% better

XML Compression Techniques

xseq lin et al 2005
XSeq (Lin et al, 2005)
  • First phase of compression groups tags and data segments into structure and data value containers
  • The contents of each container are then compressed using a grammar-based scheme (SEQUITUR)
  • Several indices are constructed to enable efficient, random-access querying over compressed containers
  • Supports exact-match, range queries in compressed domain
  • Compression rate: comparable to gzip, as much as 30% better than XGrind’s

XML Compression Techniques

xseq indices
XSeqIndices
  • File header: consists of
    • a list of pointers to structure container index and to each data value container index
    • a table recording mappings b/w tags and substitution codes used in structure container
  • Structure container indexanddata value container indices: record information about the SEQUITUR grammar generated from document’s tag sequence (resp., from contents of individual data value container)
    • number of grammar rules
    • number of symbols in each rule’s RHS
    • for each rule, occurrence counts of terminal symbols in the RHS

XML Compression Techniques

xseq query processing
XSeqQuery Processing

Example: /course/students/student[@name=“Alice”]

  • Determine token sequences relevant for query by consulting mapping table stored in file header
  • Consult structure index to determine which grammar rules contain relevant tokens
  • Expand out relevant rules and extract data values from appropriate data value containers

XML Compression Techniques

treechop leighton et al 2005b
TREECHOP (Leighton et al, 2005b)
  • A queryable compressor intended mainly for data exchange applications
    • Recipient receives a stream of XML data, wants to selectively process certain nodes
    • Avoids necessity of decompressing entire data stream in memory to extract values
  • Single-pass compression process: tree nodes are encoded adaptively during depth-first traversal of document tree, and passed to a back-end gzip compressor
  • Achieves compression rates comparable to gzip, while allowing selection of random nodes; compression/decompression speed is slightly less than gzip’s

XML Compression Techniques

treechop
TREECHOP
  • Non-leaf nodes are encoded as a binary code with three constituent parts:
    • the codeword of the parent node p
    • a variable-length Golomb codeword recording the relative position of this node w.r.t. p
    • a fixed-length code indicating the node type (element, attribute, comment, CDATA, or processing instruction)
    • For first occurrence only, the node text (e.g., element name) is appended immediately after the codeword
  • Leaf nodes are written as raw text sequences
  • Queries supported in the compressed domain: exact match and range

XML Compression Techniques

xquec arion et al 2004
XQueC (Arion et al, 2004)
  • Designed to support a large subset of XQuery in the compressed domain
  • Prototype uses either ALM (order-preserving) or Huffman (unordered) to individually compress data values
    • ALM: supports equality/inequality matching, doesn’t support prefix-matching
    • Huffman: supports equality and prefix-matching, doesn’t support inequality matching
  • A permutation-based strategy is used, in which data values are first assigned to containers according to their parent element’s path
    • Containers are later grouped together to share the same compression model if they exhibit high similarity and appear frequently together in query predicates

XML Compression Techniques

xquec
XQueC
  • Given a workload of typical queries, an attempt is made to determine the optimal container grouping, and assignment of the best available compression algorithm to apply to each group, in order to minimize the following costs
    • Decompression time
    • Storage costs for compressed data and source models
  • In addition to containers, the compression process creates two additional data structures:
    • Structure tree
    • Structure summary

XML Compression Techniques

xquec arion et al 2007 structure tree structure summary
XQueC (Arion et al, 2007)Structure Tree & Structure Summary

course

[1, 8]

course

Structure

Tree

students

students

[2, 7]

Structure

Summary

student

student

[3, 3]

student

[6,6]

@name

a1

@name

[4, 1]

a1

[5, 2]

@name

[7, 4]

a1

[8, 5]

Header

“Alice”

Container

“Bob”

XML Compression Techniques

bplex busatto et al 2005
BPLEX (Busatto et al, 2005)
  • Focuses on improving the compression of XML structure, by searching bottom-up for repeated patterns in the document’s minimal DAG
    • A succinct pointer-based representation is used to represent subsequent occurrences of a repeated subgraph
    • DOM operations can be carried out on the compressed skeleton
  • Achieves smallest queryable representation of XML structure: averages 68% of XMill’s compressed size with gzip as back-end compressor! (Maneth et al, 2008)

XML Compression Techniques

bplex
BPLEX

Buneman et al (2003) demonstrate that Core XPath queries can be efficiently

evaluated directly on the minimum DAG of a skeleton

a

a

DAG is 2/3 smaller than original skeleton

b

b

b

b

c

d

c

d

c

d

c

d

In theory, a DAG can be exponentially smaller than the original skeleton; “real

world” XML DAGS are often less than 10% of the original document size

XML Compression Techniques

bplex41
BPLEX

DAGs are limited to subtree sharing, meaning they can miss out on repeated

patterns occurring in the interior of the skeleton…

a

In this example, the DAG is equivalent to the

original skeleton – no compression

a

b

a

BPLEX generates a straight-line tree grammar

(SLT grammar) that is capable of representing

repeated, connected subgraphs

c

b

f

a

d

b

y2

e

b

SLTs are frequently half the size of the

equivalent minimal DAG

y1

XML Compression Techniques

queryable compressors summary
Queryable Compressors: Summary

XML Compression Techniques

future directions
Future Directions

XML Compression Techniques

xml updates
XML Updates
  • Impetus for incremental update of existing XML data sets is increasing
    • XML-based office document standards: ODF, OOXML
    • Increased volume of persistent XML data
  • W3C has recently proposed an extension to XQuery for expressing node-level updates
  • So far, most approaches to XML compression have assumed a “read-only” model
    • How amenable are the existing schemes to updates?

XML Compression Techniques

evaluation of xml conscious techniques
Evaluation of XML-Conscious Techniques
  • Currently, it is difficult to make definitive statements about the relative effectiveness of different techniques…
  • lack of available implementations
  • no consistent benchmark
    • each approach tends to use its own corpus
    • many queryable compressors aren’t tested thoroughly on the full set of supported queries
  • most works provide no theoretical justification; instead rely on empirical results against a limited corpus

XML Compression Techniques

slide46

Thank you

XML Compression Techniques

references
References
  • J. Adiego, P. De la Fuente, and G. Navarro. Combining structural and textual contexts for compressing semistructured databases. ENC, 2005.
  • A. Arion, A. Bonifati, I. Manolescu, and A. Pugliese. XQueC: A query-conscious compressed XML database. ACM TOIT 7(2), 2007.
  • P. Buneman, M. Grohe, and C. Koch. Path queries on compressed XML. VLDB, 2003.
  • G. Busatto, M. Lohrey, and S. Maneth. Efficient memory representation of XML documents. DBPL, 2005.
  • J. Cheney. Compressing XML with multiplexed hierarchical PPM models. DCC, 2001.
  • J. Cheney. An empirical evaluation of simple DTD-conscious compression techniques. WebDB,2005.
  • J. Cheney. Tradeoffs in XML database compression. DCC, 2006.
  • J. Cheng and W. Ng. XQzip: querying compressed XML using structural indexing. EDBT, 2004.
  • G. Leighton, J. Diamond, and T. Müldner. AXECHOP: a grammar-based compressor for XML. DCC, 2005.

XML Compression Techniques

references cont
References (cont.)
  • G. Leighton, T. Müldner, and J. Diamond. TREECHOP: a tree-based query-able compressor for XML. CWIT, 2005.
  • W. Li. XComp: An XML compression tool. Master's thesis, University of Waterloo, 2003.
  • H. Liefke and D. Suciu. XMill: an efficient compressor for XML data. SIGMOD, 2000.
  • S. Maneth, N. Mihaylov, and S. Sakr. XML tree structure compression. XANTEC, 2008.
  • J. Min, M. Park, and C. Chung. XPRESS: a queriable compression for XML data. SIGMOD, 2003.
  • P. Skibiński, Sz. Grabowski, and J. Swacha. Effective asymmetric XML compression. To appear in: Software: Practice and Experience, 2008.
  • P. Tolani and J. Haritsa. XGRIND: a query-friendly XML compressor. ICDE, 2002.

XML Compression Techniques