managing xml and semistructured data
Download
Skip this Video
Download Presentation
Managing XML and Semistructured Data

Loading in 2 Seconds...

play fullscreen
1 / 12

Managing XML and Semistructured Data - PowerPoint PPT Presentation


  • 115 Views
  • Uploaded on

Managing XML and Semistructured Data. Lecture 19: Compressing XML Data. Prof. Dan Suciu. Spring 2001. In this lecture. XML Compression Motivation XMill approach and results Resources XMILL: An Efficient Compressor for XML Data by Liefke and Suciu, in SIGMOD\'2001.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Managing XML and Semistructured Data' - cady


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
managing xml and semistructured data

Managing XML and Semistructured Data

Lecture 19: Compressing XML Data

Prof. Dan Suciu

Spring 2001

in this lecture
In this lecture
  • XML Compression
    • Motivation
    • XMill approach and results

Resources

  • XMILL: An Efficient Compressor for XML Data by Liefke and Suciu, in SIGMOD\'2001
compression the problem
Compression: The Problem
  • XML for exchange (space or time)
  • but XML is verbose
  • users prefer application specific formats:
    • Web Server Logs
    • EMBL
    • G2
  • is XML doomed to fail ?
an example web server logs
An Example:Web Server Logs

ASCII File 15.9 MB (gzipped 1.6MB):

202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I)

<apache:entry>

<apache:host> 202.239.238.16 </apache:host>

<apache:requestLine> GET / HTTP/1.0 </apache:requestLine>

<apache:contentType> text/html </apache:contentType>

<apache:statusCode> 200</apache:statusCode>

<apache:date> 1997/10/01-00:00:02</apache:date>

<apache:byteCount> 4478</apache:byteCount>

<apache:referer> http://www.net.jp/ </apache:referer>

<apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent>

</apache:entry>

XML-ized inflates to 24.2 MB (gzipped 2.1MB):

xmill
XMill
  • specialized compressor for XML data
  • makes XML look “small”
  • Download:
    • Now: www.research.att.com/sw/tools/xmill
    • Soon: www.cs.washington.edu/homes/suciu/XMILL
how xmill works three ideas
How Xmill Works: Three Ideas

Compress the structure separately from the data:

gzip Structure

gzip Data

202.239.238.16

GET / HTTP/1.0

text/html

200

<apache:entry>

<apache:host> </apache:host>

. . .

</apache:entry>

=1.75MB

+

how xmill works three ideas1
How Xmill Works: Three Ideas

Group the data values according to their types:

gzip Structure

gzip Data1

gzip Data2

<apache:entry>

. . .

</apache:entry>

202.23.23.16

224.42.24.55

GET / HTTP/1.0

GET / HTTP/1.1

=1.33MB

+

+

how xmill works three ideas2

=0.82MB

gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ...

How Xmill Works: Three Ideas

Apply semantic (specialized) compressors:

  • Examples:
  • 8, 16, 32-bit integer encoding (signed/unsigned)
  • differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...)
  • compress lists, records (e.g. 104.32.23.1  4 bytes)
  • Need user input to select the semantic compressor
summary of xml data management
Summary of XML Data Management
  • XML =
    • old data type (trees)
    • with new interpretation (data)
  • We discussed traditional management techniques for XML:
    • Data model
    • Query language
    • Optimizations
    • ...
  • Many traditional problems still unsolved (storage, processing, optimization, ...)
summary of xml data management1
Summary of XML Data Management
  • More interesting question:
    • what are the novel applications enabled by XML ?

Some ideas:

  • Approximate queries over unfamiliar data instances
    • “Search the database for a pattern similar to this one”
    • Rank results based on their similarity to the pattern
    • What is an appropriate query language for that ?
  • Linking independent databases
    • We have Xlink, how do we use it ?
ad