Xpress a queriablecompression for xml data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 45

XPRESS: A QueriableCompression for XML Data PowerPoint PPT Presentation


  • 136 Views
  • Uploaded on
  • Presentation posted in: General

XPRESS: A QueriableCompression for XML Data. Jun-Ki Min. Myung-Jae Park. Chin-Wan Chung. By Erhan Durus ü t and Burak Ç etin. Outline. Motivation Background on Compression Algorithms Existing Compressors Features of XPRESS Compression Techniques in XPRESS Experimental Results

Download Presentation

XPRESS: A QueriableCompression for XML Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Xpress a queriablecompression for xml data

XPRESS: A QueriableCompression for XML Data

Jun-Ki Min

Myung-Jae Park

Chin-Wan Chung

By Erhan Durusüt and Burak Çetin


Outline

Outline

  • Motivation

  • Background on Compression Algorithms

  • Existing Compressors

  • Features of XPRESS

  • Compression Techniques in XPRESS

  • Experimental Results

  • Conclusions and Future Work

XPRESS: A Queriable Compression for XML Data


Motivation

Motivation

  • Motivation

  • Background on Compression Algorithms

  • Existing Compressors

  • Features of XPRESS

  • Compression Techniques in XPRESS

  • Experimental Results

  • Conclusions and Future Work

XPRESS: A Queriable Compression for XML Data


Motivation1

Motivation

  • XML data is irregular and verbose

    • To overcome the verbosity problem, research on compressors for XML data has been conducted

    • some XML compressors do not support querying compressed data

    • Some of them support querying compressed data, they blindly encode tags and data values using predefined encoding methods

    • So, direct and efficient evaluations of queries on compressed XML data is required

XPRESS: A Queriable Compression for XML Data


Background on compression algorithms

Background on Compression Algorithms

  • Motivation

  • Background on Compression Algorithms

  • Existing Compressors

  • Features of XPRESS

  • Compression Techniques in XPRESS

  • Experimental Results

  • Conclusions and Future Work

XPRESS: A Queriable Compression for XML Data


Compression techniques

Compression Techniques

  • Purpose of Compression

    • Required disk space can be reduced significantly

    • Saving the network bandwidth

    • Overall performance of database systems

      • A buffer can hold more information

      • Number of disk I/Os is reduced

XPRESS: A Queriable Compression for XML Data


Classification of compression techniques

Classification of Compression Techniques

Two scans, one for statistics one for compression

We can not use lossy compression since we have text data

Statistics gathered dynamically and updated during compression

Fixed statistics or no statistics at all

XPRESS: A Queriable Compression for XML Data


Classification of compression techniques1

Classification of Compression Techniques

  • Static Compression

    • Dictionary encoding – assigns an integer value to each new word

      • Example : “the classification of the data”

      • Encoded : 1 2 3 1 4

    • Binary encoding – special types of data can be encoded in binary

      • Example : “8627” in string

      • Encoded : 8627 in numeric

    • Differential encoding – replaces a data item with a code value that defines its relationship to a specific data item

      • Example : 1500, 1520, 1600, 1550

      • Encoded : 1500, 20, 100, 50

XPRESS: A Queriable Compression for XML Data


Classification of compression techniques2

Classification of Compression Techniques

  • Semi-adaptive Compression

    • Huffman encoding

-Assign shorter codes to more frequently appearing symbols

-Assign 0 to left edge and 1 to the right

-Does not keep the order info

XPRESS: A Queriable Compression for XML Data


Classification of compression techniques3

Classification of Compression Techniques

  • Semi-adaptive Compression

    • Arithmetic encoding

      • Symbols are assigned disjoint intervals according to their frequencies

      • Successive symbols of a message reduce the length of interval of the first symbol in accordance with the frequencies of the symbols.

      • Example :

“a”

“b”

“c”

0

1.0

“ab”

XPRESS: A Queriable Compression for XML Data


Existing compressors

Existing Compressors

  • Motivation

  • Background on Compression Algorithms

  • Existing Compressors

  • Features of XPRESS

  • Compression Techniques in XPRESS

  • Experimental Results

  • Conclusions and Future Work

XPRESS: A Queriable Compression for XML Data


Existing compressors1

Existing Compressors

  • XMILL

    • Separates XML tags and attributes from their data values and groups semantically related data values into containers.

    • XML tags and attributes are compressed by the dictionary encoding method.

    • To choose the compression algorithm for the container it needs human interpretation.

    • Finally, they are compressed again by a buildin library called “zlib”

XPRESS: A Queriable Compression for XML Data


Existing compressors2

Existing Compressors

  • XGRIND

    • Supports querying compressed XML data

    • Data values compressed by huffman or dictionary encoding, tags compressed by dictionary encoding

    • Uses DTD to determine the encoder for data values

    • A path expression is evaluated by scanning the compressed file and whenever a new tag is found the two path expressions are compared and decided

    • To evaluate range queries partial decompression of data values is always required

XPRESS: A Queriable Compression for XML Data


Features of xpress

Features of XPRESS

  • Motivation

  • Background on Compression Algorithms

  • Existing Compressors

  • Features of XPRESS

  • Compression Techniques in XPRESS

  • Experimental Results

  • Conclusions and Future Work

XPRESS: A Queriable Compression for XML Data


Features of xpress1

Features of XPRESS

  • Reverse arithmetic encoding

    • Existing XML compressors : each tag by a unique identifier inefficient handling path expressions

    • Here, a label path as a distinct interval in[0.0, 1.0)

    • Handling of path expressions : containment relationships

XPRESS: A Queriable Compression for XML Data


Features of xpress2

Features of XPRESS

  • Automatic Type Inference

    • Some XML compressors use predefined encodings

      • E.g. Huffman, dictionary encoding

    • However, efficiency depends on data type

    • Some require manual interpretation

    • Requirement of a type inference engine

XPRESS: A Queriable Compression for XML Data


Features of xpress3

Features of XPRESS

  • Application of diverse encoding methods to different types

    • Inferred type – proper encoding methods

      • numeric: binary encoding

        Example: ‘120’, ’150’, ’100’, ’130’

        Encoded as ‘20’, ’50’, ’0’, ’30’

      • textual: huffman encoder

      • enumeration: dictionary encoder

    • High compression ratio

    • Less frequent partial decompression

XPRESS: A Queriable Compression for XML Data


Features of xpress4

Features of XPRESS

  • Semi-adaptive approach

    • Preliminary scan for statistics

    • Statistics not changed during compression

    • Encoding rules independent to location

XPRESS: A Queriable Compression for XML Data


Features of xpress5

Features of XPRESS

  • Homomorphic Compression

    • Preserves the structure of XML data

    • Efficient extraction

XPRESS: A Queriable Compression for XML Data


Reverse arithmetic encoding

Reverse Arithmetic Encoding

  • Simple Path: a sequence of one or more dot-separated tags t1.t2…tn.

    Example: the simple path of subsectionis book.section.subsection

  • Label Path: a1.a2…an is the simple path of e. Thus ak,ak+1…an is the label path of e, where 1<=k<n.

    Example: section.subsection is a label path of subsection

  • Suffix: two label paths, P=pi…pn and Q=pj…pn of e, if i>=j, the P is a suffix of Q

XPRESS: A Queriable Compression for XML Data


Reverse arithmetic encoding1

Reverse Arithmetic Encoding

  • First partitions the entire interval [0.0, 1.0) into subinterval, one for each distinct element. The size is proportional to the frequency.

    Example: frequencies of elements={book, author, title, section, subsection, subtitle} are (0.1, 0.1, 0.1, 0.3, 0.3, 0.1)

XPRESS: A Queriable Compression for XML Data


Reverse arithmetic encoding2

Reverse Arithmetic Encoding

  • Next, encodes the simple path P=p1…pn of an element e into an subinterval [mine, maxe)

XPRESS: A Queriable Compression for XML Data


Reverse arithmetic encoding3

Reverse Arithmetic Encoding

  • Property 1:Suppose that a simple path p is represented as the interval I, then all intervals for suffixes of P contain I.

    Example: simple path book.section.subsection interval [0.69, 0.699)

    label path section.subsection interval[0.69, 0.78)

    label path subsection interval[0.6, 0.9)

    Implication: query processor selects the elements whose corresponding intervals are within the interval of the query. //section/subsection then choose intervals within [0.69, 0.78)

  • Finally, the start tag of an element is replaced by the value of the subinterval.

XPRESS: A Queriable Compression for XML Data


Compression techniques in xpress

Compression Techniques in XPRESS

  • Motivation

  • Background on Compression Algorithms

  • Existing Compressors

  • Features of XPRESS

  • Compression Techniques in XPRESS

  • Experimental Results

  • Conclusions and Future Work

XPRESS: A Queriable Compression for XML Data


Architecture of xpress

Architecture of XPRESS

XPRESS: A Queriable Compression for XML Data


Xml analyzer

XML Analyzer

  • Parses each token in XML file while keeping trace of the path

    • If a tag : collects statistics

    • If data value : apply type inferencing

XPRESS: A Queriable Compression for XML Data


Xml analyzer algorithm

XML Analyzer Algorithm

XPRESS: A Queriable Compression for XML Data


Applying arithmetic encoder

Applying Arithmetic Encoder

  • Problematic: counting appearances of each distinct element

    • Higher level tags appear rarely (e.g. root)

    • Intervals for long paths shrink too quickly

    • Requires use of high-precision numbers

  • Instead use: Path Tree (Weighted Frequency)

XPRESS: A Queriable Compression for XML Data


Weighted frequency

Weighted Frequency

  • Weighted Frequency: Number of subnodes + itself

    • Can consume so much memory; O(E)

    • Not efficient to construct

XPRESS: A Queriable Compression for XML Data


Adjusted frequency

Adjusted Frequency

  • Add 1 to ancestors whenever a new node is met

  • Requires O(L) space ; L max. length of a query

  • Efficient heuristics

XPRESS: A Queriable Compression for XML Data


Statistics collector

Statistics Collector

XPRESS: A Queriable Compression for XML Data


Type inferencing

Type Inferencing

  • Determine whether data is:

    • Integer

    • Floating point

    • Enumaration type

    • String

XPRESS: A Queriable Compression for XML Data


Type inferencing1

Type Inferencing

  • Engine keeps track of:

    • inferred_type

    • min,max

    • symhash

    • chars_frequency

  • Inferred type can change in the process:

    • from integer to string

    • from dictionary to string

XPRESS: A Queriable Compression for XML Data


Xpress a queriablecompression for xml data

XPRESS: A Queriable Compression for XML Data


Xml encoder

XML Encoder

  • For data MSB is 0, for structure 1

XPRESS: A Queriable Compression for XML Data


Xpress a queriablecompression for xml data

ARAE

  • ARAE: Approximated Reverse Arithmetic Encoder

    • Ensures the MSB of encoded value is 1

    • Truncates the last byte from float

    • Truncations does not change the containment relationship

    • May incure inefficieny if too much truncated

XPRESS: A Queriable Compression for XML Data


Encoder algorithm

Encoder Algorithm

XPRESS: A Queriable Compression for XML Data


Query processing

Query Processing

  • If too long query the interval gets too little

  • Split query into intervals with sizes greater than 2-15

  • Look for sequence of splitted intervals

  • Generally sequence length is 1

XPRESS: A Queriable Compression for XML Data


Query processing1

Query Processing

  • Exact matching conditions are encoded

  • Range queries for numerical values done directly

  • Partial decompression needed for range queries on strings

    • Huffman and Dictionary encoding do not preserve order information

XPRESS: A Queriable Compression for XML Data


Experimental results

Experimental Results

  • Motivation

  • Background on Compression Algorithms

  • Existing Compressors

  • Features of XPRESS

  • Compression Techniques in XPRESS

  • Experimental Results

  • Conclusions and Future Work

XPRESS: A Queriable Compression for XML Data


Experiments

Experiments

  • Extensive experiments on real life data with different characteristics

XPRESS: A Queriable Compression for XML Data


Compression ratios

Compression Ratios

XPRESS: A Queriable Compression for XML Data


Sample queries

Sample Queries

  • Different types of queries are run:

XPRESS: A Queriable Compression for XML Data


Query evaluation time

Query Evaluation Time

XPRESS: A Queriable Compression for XML Data


Conclusion and future work

Conclusion and Future Work

  • Novel approach “Reverse Arithmetic Encoding” is successful

  • Superior to XGrind

  • Future support for complex data types

    • e.g. Uniform Resource Identifier (URI)

XPRESS: A Queriable Compression for XML Data


  • Login