Managing xml and semistructured data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 12

Managing XML and Semistructured Data PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on
  • Presentation posted in: General

Managing XML and Semistructured Data. Lecture 19: Compressing XML Data. Prof. Dan Suciu. Spring 2001. In this lecture. XML Compression Motivation XMill approach and results Resources XMILL: An Efficient Compressor for XML Data by Liefke and Suciu, in SIGMOD'2001.

Download Presentation

Managing XML and Semistructured Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Managing xml and semistructured data

Managing XML and Semistructured Data

Lecture 19: Compressing XML Data

Prof. Dan Suciu

Spring 2001


In this lecture

In this lecture

  • XML Compression

    • Motivation

    • XMill approach and results

      Resources

  • XMILL: An Efficient Compressor for XML Data by Liefke and Suciu, in SIGMOD'2001


Compression the problem

Compression: The Problem

  • XML for exchange (space or time)

  • but XML is verbose

  • users prefer application specific formats:

    • Web Server Logs

    • EMBL

    • G2

  • is XML doomed to fail ?


An example web server logs

An Example:Web Server Logs

ASCII File 15.9 MB (gzipped 1.6MB):

202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I)

<apache:entry>

<apache:host> 202.239.238.16 </apache:host>

<apache:requestLine> GET / HTTP/1.0 </apache:requestLine>

<apache:contentType> text/html </apache:contentType>

<apache:statusCode> 200</apache:statusCode>

<apache:date> 1997/10/01-00:00:02</apache:date>

<apache:byteCount> 4478</apache:byteCount>

<apache:referer> http://www.net.jp/ </apache:referer>

<apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent>

</apache:entry>

XML-ized inflates to 24.2 MB (gzipped 2.1MB):


Xmill

XMill

  • specialized compressor for XML data

  • makes XML look “small”

  • Download:

    • Now: www.research.att.com/sw/tools/xmill

    • Soon: www.cs.washington.edu/homes/suciu/XMILL


How xmill works three ideas

How Xmill Works: Three Ideas

Compress the structure separately from the data:

gzip Structure

gzip Data

202.239.238.16

GET / HTTP/1.0

text/html

200

<apache:entry>

<apache:host> </apache:host>

. . .

</apache:entry>

=1.75MB

+


How xmill works three ideas1

How Xmill Works: Three Ideas

Group the data values according to their types:

gzip Structure

gzip Data1

gzip Data2

<apache:entry>

. . .

</apache:entry>

202.23.23.16

224.42.24.55

GET / HTTP/1.0

GET / HTTP/1.1

=1.33MB

+

+


How xmill works three ideas2

=0.82MB

gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ...

How Xmill Works: Three Ideas

Apply semantic (specialized) compressors:

  • Examples:

  • 8, 16, 32-bit integer encoding (signed/unsigned)

  • differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...)

  • compress lists, records (e.g. 104.32.23.1  4 bytes)

  • Need user input to select the semantic compressor


Xml compression

XML Compression


Compression tradeoff

Compression Tradeoff


Summary of xml data management

Summary of XML Data Management

  • XML =

    • old data type (trees)

    • with new interpretation (data)

  • We discussed traditional management techniques for XML:

    • Data model

    • Query language

    • Optimizations

    • ...

  • Many traditional problems still unsolved (storage, processing, optimization, ...)


Summary of xml data management1

Summary of XML Data Management

  • More interesting question:

    • what are the novel applications enabled by XML ?

      Some ideas:

  • Approximate queries over unfamiliar data instances

    • “Search the database for a pattern similar to this one”

    • Rank results based on their similarity to the pattern

    • What is an appropriate query language for that ?

  • Linking independent databases

    • We have Xlink, how do we use it ?


  • Login