Xpress a queriablecompression for xml data
Download
1 / 45

XPRESS: A QueriableCompression for XML Data - PowerPoint PPT Presentation


  • 167 Views
  • Uploaded on

XPRESS: A QueriableCompression for XML Data. Jun-Ki Min. Myung-Jae Park. Chin-Wan Chung. By Erhan Durus ü t and Burak Ç etin. Outline. Motivation Background on Compression Algorithms Existing Compressors Features of XPRESS Compression Techniques in XPRESS Experimental Results

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' XPRESS: A QueriableCompression for XML Data' - yan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Xpress a queriablecompression for xml data

XPRESS: A QueriableCompression for XML Data

Jun-Ki Min

Myung-Jae Park

Chin-Wan Chung

By Erhan Durusüt and Burak Çetin


Outline
Outline

  • Motivation

  • Background on Compression Algorithms

  • Existing Compressors

  • Features of XPRESS

  • Compression Techniques in XPRESS

  • Experimental Results

  • Conclusions and Future Work

XPRESS: A Queriable Compression for XML Data


Motivation
Motivation

  • Motivation

  • Background on Compression Algorithms

  • Existing Compressors

  • Features of XPRESS

  • Compression Techniques in XPRESS

  • Experimental Results

  • Conclusions and Future Work

XPRESS: A Queriable Compression for XML Data


Motivation1
Motivation

  • XML data is irregular and verbose

    • To overcome the verbosity problem, research on compressors for XML data has been conducted

    • some XML compressors do not support querying compressed data

    • Some of them support querying compressed data, they blindly encode tags and data values using predefined encoding methods

    • So, direct and efficient evaluations of queries on compressed XML data is required

XPRESS: A Queriable Compression for XML Data


Background on compression algorithms
Background on Compression Algorithms

  • Motivation

  • Background on Compression Algorithms

  • Existing Compressors

  • Features of XPRESS

  • Compression Techniques in XPRESS

  • Experimental Results

  • Conclusions and Future Work

XPRESS: A Queriable Compression for XML Data


Compression techniques
Compression Techniques

  • Purpose of Compression

    • Required disk space can be reduced significantly

    • Saving the network bandwidth

    • Overall performance of database systems

      • A buffer can hold more information

      • Number of disk I/Os is reduced

XPRESS: A Queriable Compression for XML Data


Classification of compression techniques
Classification of Compression Techniques

Two scans, one for statistics one for compression

We can not use lossy compression since we have text data

Statistics gathered dynamically and updated during compression

Fixed statistics or no statistics at all

XPRESS: A Queriable Compression for XML Data


Classification of compression techniques1
Classification of Compression Techniques

  • Static Compression

    • Dictionary encoding – assigns an integer value to each new word

      • Example : “the classification of the data”

      • Encoded : 1 2 3 1 4

    • Binary encoding – special types of data can be encoded in binary

      • Example : “8627” in string

      • Encoded : 8627 in numeric

    • Differential encoding – replaces a data item with a code value that defines its relationship to a specific data item

      • Example : 1500, 1520, 1600, 1550

      • Encoded : 1500, 20, 100, 50

XPRESS: A Queriable Compression for XML Data


Classification of compression techniques2
Classification of Compression Techniques

  • Semi-adaptive Compression

    • Huffman encoding

-Assign shorter codes to more frequently appearing symbols

-Assign 0 to left edge and 1 to the right

-Does not keep the order info

XPRESS: A Queriable Compression for XML Data


Classification of compression techniques3
Classification of Compression Techniques

  • Semi-adaptive Compression

    • Arithmetic encoding

      • Symbols are assigned disjoint intervals according to their frequencies

      • Successive symbols of a message reduce the length of interval of the first symbol in accordance with the frequencies of the symbols.

      • Example :

“a”

“b”

“c”

0

1.0

“ab”

XPRESS: A Queriable Compression for XML Data


Existing compressors
Existing Compressors

  • Motivation

  • Background on Compression Algorithms

  • Existing Compressors

  • Features of XPRESS

  • Compression Techniques in XPRESS

  • Experimental Results

  • Conclusions and Future Work

XPRESS: A Queriable Compression for XML Data


Existing compressors1
Existing Compressors

  • XMILL

    • Separates XML tags and attributes from their data values and groups semantically related data values into containers.

    • XML tags and attributes are compressed by the dictionary encoding method.

    • To choose the compression algorithm for the container it needs human interpretation.

    • Finally, they are compressed again by a buildin library called “zlib”

XPRESS: A Queriable Compression for XML Data


Existing compressors2
Existing Compressors

  • XGRIND

    • Supports querying compressed XML data

    • Data values compressed by huffman or dictionary encoding, tags compressed by dictionary encoding

    • Uses DTD to determine the encoder for data values

    • A path expression is evaluated by scanning the compressed file and whenever a new tag is found the two path expressions are compared and decided

    • To evaluate range queries partial decompression of data values is always required

XPRESS: A Queriable Compression for XML Data


Features of xpress
Features of XPRESS

  • Motivation

  • Background on Compression Algorithms

  • Existing Compressors

  • Features of XPRESS

  • Compression Techniques in XPRESS

  • Experimental Results

  • Conclusions and Future Work

XPRESS: A Queriable Compression for XML Data


Features of xpress1
Features of XPRESS

  • Reverse arithmetic encoding

    • Existing XML compressors : each tag by a unique identifier inefficient handling path expressions

    • Here, a label path as a distinct interval in[0.0, 1.0)

    • Handling of path expressions : containment relationships

XPRESS: A Queriable Compression for XML Data


Features of xpress2
Features of XPRESS

  • Automatic Type Inference

    • Some XML compressors use predefined encodings

      • E.g. Huffman, dictionary encoding

    • However, efficiency depends on data type

    • Some require manual interpretation

    • Requirement of a type inference engine

XPRESS: A Queriable Compression for XML Data


Features of xpress3
Features of XPRESS

  • Application of diverse encoding methods to different types

    • Inferred type – proper encoding methods

      • numeric: binary encoding

        Example: ‘120’, ’150’, ’100’, ’130’

        Encoded as ‘20’, ’50’, ’0’, ’30’

      • textual: huffman encoder

      • enumeration: dictionary encoder

    • High compression ratio

    • Less frequent partial decompression

XPRESS: A Queriable Compression for XML Data


Features of xpress4
Features of XPRESS

  • Semi-adaptive approach

    • Preliminary scan for statistics

    • Statistics not changed during compression

    • Encoding rules independent to location

XPRESS: A Queriable Compression for XML Data


Features of xpress5
Features of XPRESS

  • Homomorphic Compression

    • Preserves the structure of XML data

    • Efficient extraction

XPRESS: A Queriable Compression for XML Data


Reverse arithmetic encoding
Reverse Arithmetic Encoding

  • Simple Path: a sequence of one or more dot-separated tags t1.t2…tn.

    Example: the simple path of subsectionis book.section.subsection

  • Label Path: a1.a2…an is the simple path of e. Thus ak,ak+1…an is the label path of e, where 1<=k<n.

    Example: section.subsection is a label path of subsection

  • Suffix: two label paths, P=pi…pn and Q=pj…pn of e, if i>=j, the P is a suffix of Q

XPRESS: A Queriable Compression for XML Data


Reverse arithmetic encoding1
Reverse Arithmetic Encoding

  • First partitions the entire interval [0.0, 1.0) into subinterval, one for each distinct element. The size is proportional to the frequency.

    Example: frequencies of elements={book, author, title, section, subsection, subtitle} are (0.1, 0.1, 0.1, 0.3, 0.3, 0.1)

XPRESS: A Queriable Compression for XML Data


Reverse arithmetic encoding2
Reverse Arithmetic Encoding

  • Next, encodes the simple path P=p1…pn of an element e into an subinterval [mine, maxe)

XPRESS: A Queriable Compression for XML Data


Reverse arithmetic encoding3
Reverse Arithmetic Encoding

  • Property 1:Suppose that a simple path p is represented as the interval I, then all intervals for suffixes of P contain I.

    Example: simple path book.section.subsection interval [0.69, 0.699)

    label path section.subsection interval[0.69, 0.78)

    label path subsection interval[0.6, 0.9)

    Implication: query processor selects the elements whose corresponding intervals are within the interval of the query. //section/subsection then choose intervals within [0.69, 0.78)

  • Finally, the start tag of an element is replaced by the value of the subinterval.

XPRESS: A Queriable Compression for XML Data


Compression techniques in xpress
Compression Techniques in XPRESS

  • Motivation

  • Background on Compression Algorithms

  • Existing Compressors

  • Features of XPRESS

  • Compression Techniques in XPRESS

  • Experimental Results

  • Conclusions and Future Work

XPRESS: A Queriable Compression for XML Data


Architecture of xpress
Architecture of XPRESS

XPRESS: A Queriable Compression for XML Data


Xml analyzer
XML Analyzer

  • Parses each token in XML file while keeping trace of the path

    • If a tag : collects statistics

    • If data value : apply type inferencing

XPRESS: A Queriable Compression for XML Data


Xml analyzer algorithm
XML Analyzer Algorithm

XPRESS: A Queriable Compression for XML Data


Applying arithmetic encoder
Applying Arithmetic Encoder

  • Problematic: counting appearances of each distinct element

    • Higher level tags appear rarely (e.g. root)

    • Intervals for long paths shrink too quickly

    • Requires use of high-precision numbers

  • Instead use: Path Tree (Weighted Frequency)

XPRESS: A Queriable Compression for XML Data


Weighted frequency
Weighted Frequency

  • Weighted Frequency: Number of subnodes + itself

    • Can consume so much memory; O(E)

    • Not efficient to construct

XPRESS: A Queriable Compression for XML Data


Adjusted frequency
Adjusted Frequency

  • Add 1 to ancestors whenever a new node is met

  • Requires O(L) space ; L max. length of a query

  • Efficient heuristics

XPRESS: A Queriable Compression for XML Data


Statistics collector
Statistics Collector

XPRESS: A Queriable Compression for XML Data


Type inferencing
Type Inferencing

  • Determine whether data is:

    • Integer

    • Floating point

    • Enumaration type

    • String

XPRESS: A Queriable Compression for XML Data


Type inferencing1
Type Inferencing

  • Engine keeps track of:

    • inferred_type

    • min,max

    • symhash

    • chars_frequency

  • Inferred type can change in the process:

    • from integer to string

    • from dictionary to string

XPRESS: A Queriable Compression for XML Data



Xml encoder
XML Encoder

  • For data MSB is 0, for structure 1

XPRESS: A Queriable Compression for XML Data


ARAE

  • ARAE: Approximated Reverse Arithmetic Encoder

    • Ensures the MSB of encoded value is 1

    • Truncates the last byte from float

    • Truncations does not change the containment relationship

    • May incure inefficieny if too much truncated

XPRESS: A Queriable Compression for XML Data


Encoder algorithm
Encoder Algorithm

XPRESS: A Queriable Compression for XML Data


Query processing
Query Processing

  • If too long query the interval gets too little

  • Split query into intervals with sizes greater than 2-15

  • Look for sequence of splitted intervals

  • Generally sequence length is 1

XPRESS: A Queriable Compression for XML Data


Query processing1
Query Processing

  • Exact matching conditions are encoded

  • Range queries for numerical values done directly

  • Partial decompression needed for range queries on strings

    • Huffman and Dictionary encoding do not preserve order information

XPRESS: A Queriable Compression for XML Data


Experimental results
Experimental Results

  • Motivation

  • Background on Compression Algorithms

  • Existing Compressors

  • Features of XPRESS

  • Compression Techniques in XPRESS

  • Experimental Results

  • Conclusions and Future Work

XPRESS: A Queriable Compression for XML Data


Experiments
Experiments

  • Extensive experiments on real life data with different characteristics

XPRESS: A Queriable Compression for XML Data


Compression ratios
Compression Ratios

XPRESS: A Queriable Compression for XML Data


Sample queries
Sample Queries

  • Different types of queries are run:

XPRESS: A Queriable Compression for XML Data


Query evaluation time
Query Evaluation Time

XPRESS: A Queriable Compression for XML Data


Conclusion and future work
Conclusion and Future Work

  • Novel approach “Reverse Arithmetic Encoding” is successful

  • Superior to XGrind

  • Future support for complex data types

    • e.g. Uniform Resource Identifier (URI)

XPRESS: A Queriable Compression for XML Data


ad