Xml compression
This presentation is the property of its rightful owner.
Sponsored Links
1 / 23

XML Compression PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on
  • Presentation posted in: General

XML Compression. Aslam Tajwala Kalyan Chakravorty. Overview. Motivation for XML Compression Techniques for achieving XML compression XMill XMill Architecture. Why Compress XML?. Structured nature of XML makes it understandable to humans, Downside: XML is verbose

Download Presentation

XML Compression

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Xml compression

XML Compression

Aslam Tajwala

Kalyan Chakravorty


Overview

Overview

  • Motivation for XML Compression

  • Techniques for achieving XML compression

  • XMill

  • XMill Architecture


Why compress xml

Why Compress XML?

  • Structured nature of XML makes it understandable to humans,

  • Downside: XML is verbose

    • Each non-empty element tag must end with a matching closing tag -- <tag>data</tag>

    • Ordering of tags is often repeated in a document (e.g. multiple records)


Why compress xml 2

Why Compress XML?: 2

  • XML documents are text-based: well-known compression schemes such as Huffman and LZ can be easily applied

  • Can gain a significant savings from compression, due to highly structured nature of XML

  • XML is being used more frequently in real-time applications (e.g. web service-based e-commerce applications); increasing interest in finding ways to reduce overall size of XML documents


Using huffman lz

Using Huffman/LZ

  • Usually some degree of repetition in an XML document (multiple occurrences of tags, attribute or data values)

  • Compression schemes like Huffman and LZ can use this repetition to achieve some degree of compression


Using huffman lz 2

Using Huffman/LZ: 2

  • Many existing (and efficient) implementations of these algorithms are readily available (e.g. gzip)

  • Downside is that these techniques aren’t fully capable of exploiting the structure of XML to achieve greater compression


Huffman encoding example

Huffman Encoding Example

  • ACDABA

  • Since these are 6 characters, this text is 6 bytes or 48 bits long

  • tree is build that replaces the symbols by shorter bit sequences. In this particular case, the algorithm would use the following substitution table: A=0, B=10, C=110, D=111

  • 01101110100 (ACDABA = 11 bits)


Lz77 example dictionary based compressors

LZ77 Example( Dictionary Based Compressors)

  • Lempel-Ziv 77 algorithm

  • Dictionary is a portion of encoded sequence

  • The encoder examines the input sequence through a sliding window

  • The window consists of two parts:

    • a search buffer that contains a portion of the recently encoded sequence, and

    • a look-ahead buffer that contains the next portion of the sequence to be encoded.


Xmill liefke and suciu 2000

XMill (Liefke and Suciu, 2000)

  • Relies heavily on zlib, the compression library used in gzip

  • Also defines a few data type specific compressors; user-defined compressors can be added using SCAPI (Semantic Compressor API)

  • During compression, each XML tag is examined to see which compression technique(s) should be applied


Xml compression1

XML Compression

  • View XML as a tree

  • Separate the tree structure and what is stored in leaves

  • Save the tree structure so that it can be restored

  • The compressed file may or may not remember the tree structure

breadfruit tree


Xmill compression strategy

XMill: Compression Strategy

  • XMill applies 3 principles during compression:

    • Separate structure (element tags and attribute names) from data

    • Group related data items in a single container; compress each container separately

    • Apply appropriate semantic compressors to each container


Xmill separating structure from content

XMill – Separating Structure From Content

  • Start tags and attribute names are dictionary-encoded (as T1, T2, etc.)

  • End tags replaced with ‘/’ token

  • Data values replaced with their container number


Xmill separating structure from content 2

XMill – Separating Structure From Content 2

<Employees>

<Employee id=“1”>Homer Simpson</Employee>

<Employee id=“2”>Frank Grimes</Employee>

</Employees>

Structure Container

T1 T2 T3 C3 / C4 / T2 T3 C3 / C4 / /

Dictionary

T1 =>Employees

T2 => Employee

T3 => @id

C3

1

2

C4

Homer Simpson

Frank Grimes


Xmill container expressions

XMill: Container Expressions

  • Users can override default settings using the container expression language

    • Specify container membership

    • Specify which semantic compressor(s) are applied for each container

  • E.g. to indicate all ‘Name’ and ‘Location’ tags should be grouped in the same container:

    xmill –p //(Name | Location) employees.xml


Xmill semantic compressors

XMill: Semantic Compressors


Xmill semantic compressors 2

XMill: Semantic Compressors 2


Xmill semantic compressors 3

XMill: Semantic Compressors 3

  • Text compressor is applied to each element by default

  • User can add other instructions via command line:

    xmill –p //price=>i file.xml

    Applies integer compressor to each occurrence of ‘price’ element in file.xml


Xmill architecture 1 3

XMill Architecture (1/3)


Xmill architecture 2 3

XMill Architecture (2/3)

  • SAX Parser

    • sends tokens to the path processor.

  • Path Processor

    • determines how to map data values to containers.

  • Semantic Compressors

    • compresses the input and copies it to the container – in the memory window.

    • E.x. binary encoding of integers, differential compressors.

      When the window is filled, all containers are gzipped, stored on disk, and the compression resumes.


Performance evaluation 1 2

Performance Evaluation (1/2)


Performance evaluation 2 2

Performance Evaluation (2/2)


References

References

  • XMill:An efficent Compressor for XML Data

  • XGrind:A query friendly compressor

  • www.cs.washington.edu/homes/ suciu/COURSES/590DS/19compression.ppt


Xml compression

  • Questions ?


  • Login