1 / 21

Optimizing XML Compression

Optimizing XML Compression. Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009. Motivation. XML’s verbose nature (repeated subtrees, lengthy tag names,…) can greatly inflate document size Compression is an obvious solution

haroun
Download Presentation

Optimizing XML Compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009

  2. Motivation • XML’s verbose nature (repeated subtrees, lengthy tag names,…) can greatly inflate document size • Compression is an obvious solution • “Generic” algorithms (gzip, bzip2) often yield good compression rates for XML, yet hinder query processing since they don’t preserve the XML structure of the original file  one must decompress entire document before accessing individual nodes in the XML tree • “XML-conscious” compression schemes have been developed which are capable of preserving document structure • For the latter schemes, choosing the “best” compression strategy is very difficult • Compressing similar data values together can reduce storage costs, but tends to increase time costs for compression and decompression • Our goal: to “optimize” the performance of XML-conscious compression schemes by suggesting a good trade-off between space savings and time costs Optimizing XML Compression

  3. XML-Conscious Compression Schemes Permutation-based Approaches Document is rearranged to localize repetitions before passing to back-end compressor(s) • Data segments are grouped into different containers, typically based on the parent element’s type • Tag structure (“skeleton”) and data segments are compressed separately XML Document Shredder Skeleton Data Containers Data Compressor Data Compressor Structure Compressor Data Compressors Optimizing XML Compression

  4. users user user prestige prestige @id favorites @id favorites “4.7” “3.9” “u-8026” movies music “u-9125” movies movie song movie movie title year title year title year title artist “Smart People” “Scary Movie 2” “2001” “Smoke” “1995” “2008” “A New Career in a New Town” “David Bowie” Optimizing XML Compression

  5. Path-Based Partitioning Strategy Optimizing XML Compression

  6. Improving Permutation-Based XML Compression • Selecting a different container partitioning strategy • Consider additional factors besides identity of parent element (e.g., data type of content) • Grouping together containers with high pairwise similarity allows them to share same compression model, reducing storage costs • Can improve random access performance – group together values frequently appearing in same queries • Choosing “better” compression algorithms for container subsets • Choose an algorithm best suited for the data type • Certain algorithms support operations over compressed data (e.g., the Huffman algorithm supports equality comparisons) • Goal is to find an optimal compression configuration specifying a partitioning strategy and an algorithm assignment for each container subset such that overall compression gain is maximized Optimizing XML Compression

  7. An Example Compression Configuration Huffman coding LZ77 Terminology: Container subset = a set of one or more containers Container grouping = a set of one or more container subsets Partitioning strategy = a container grouping that covers all containers LZ77 Containers B, E, and H store real numbers, so it may be beneficial to group them within a single container subset. Huffman coding Form a container subset consisting of containers C, F, and G based on intuition that these containers possess“similar” data (English text) The contents of these 3 containers are concatenated and compressed using a single run of the LZ77 algorithm. Optimizing XML Compression

  8. Evaluating Compression Configurations • 3 relevant measures: • Storage gain: measures relative amount of space saved by applying a specific compression algorithm a to a container subset S, denoted gain(S,a) • considers not just the compressed size of S, but also the space requirements of the data structures constructed by the compression source model (e.g., Huffman tree) • Compression and decompression time costs: measure how long it takes to apply/reverse the compression process of a on S, denoted as comp(S,a) and decomp(S,a) Optimizing XML Compression

  9. Discovering an Optimal Compression Configuration Problem: Supplied with algorithm set A, set of data containers C, upper bounds Tc and Td on compression/decompression time costs, find the compression configuration that maximizes the total storage gain while observing the bounds defined by Tc and Td. Theorem: Selecting an optimal compression configuration is NP-hard. “Hardness” is due to the inherent difficulty of selecting an optimal partitioning strategy; algorithm assignment requires polynomial time (w.r.t. |A| and |C|) Optimizing XML Compression

  10. An Approximation Algorithm for Compression Configuration Selection • Stage 1: Use a branch-and-bound procedure to discover a set of candidate partitioning strategies • Stage 2: Test available compression algorithms against the set of candidate partitioning strategies, returning the resulting compression configuration that maximizes compression gain while observing specified bounds on compression/decompression time costs Optimizing XML Compression

  11. Branch-and-Bound ProcedureEstimating Compressibility of Container Subsets Shannon’s entropy rate indicates the best compression any lossless scheme can achieve • impractical to compute, so we instead turn to an approximate measure of compressibility LZ76: Lempel and Ziv (1976) proposed a measure of finite string complexity that asymptotically approaches the entropy rate Idea: parse input string x from left-to-right, recursively building a set of phrases Px. Complexity CLZ(x) is given by the ratio of phrases per character: CLZ(x) = |Px|/|x| Optimizing XML Compression

  12. Branch-and-Bound ProcedureEstimating Storage Costs for a Container Subset Represents the cost for encoding the “innovative” character each time a new dictionary entry is created The storage cost (in bits) associated with a container subset S is computed as storageCost(S) = t · (8 + log2(t)) where t is the number of phrases in the LZ76 dictionary for S Using a fixed-length binary encoding, each dictionary phrase requires log2(t) bits, or t · log2(t) bits for all t phrases The storage cost for a container grouping is the sum of storage costs of each container subset in that grouping. Optimizing XML Compression

  13. Branch-and-Bound ProcedureModeling Compression Gain Local compression gain (localGain): indicates the overall gain achieved by a grouping G; computed as the sum of localGain values of each container subset S in G The localGain for each subset S is computed as Estimated size of compressed S Size of LZ76 dictionary for S Uncompressed size of S where and |S| denotes the total byte length of the contents of S. Optimizing XML Compression

  14. Branch-and-Bound ProcedureModeling Compression Gain Maximum potential compression gain (mpGain): indicates an upper bound on the achievable localGain for any partitioning strategy that uses the current grouping G as a “starting point” (i.e., it preserves the subset placement of all containers present in G) Maximized for a container subset when CLZ(S) and storageCost(S) are minimized • this happens when the longest applicable phrase is created at every step during LZ76 parsing of S, by appending a new character to the end of the longest phrase currently in the dictionary • mpGain calculation consists of building successively longer phrases, until all remaining characters in the original container set C have been processed Optimizing XML Compression

  15. Example Gain Calculation Assume there are two containers, C1 = {aaabc} and C2, containing an additional 5 characters. The localGain of C1 is computed by performing an LZ76 parsing of C1. CLZ(C1) = 4 phrases / 5 characters = 0.8 storageCost(C1) = 4 · (8 + log2(4)) = 40 bits localGain(C1) = max{0, 5 · 8 – (0.8 · 5 + (4 · (8 + log2(4)))) } = 0 C1 should be left uncompressed a a a b c Dictionary: <a> <aa> <b> <c> To compute mpGain(C1), we continue performing these steps until the 5 characters from C2 have been processed: <aaa> nUnProcessedChars: 2 5 0 • Select longest phrase P in the dictionary, having length L Updated CLZ = 5 phrases / 10 characters = 0.5 Updated storageCost = 5 ·(8 + log2(5)) ≈ 51.6096 bits mpGain ≈ 10 · 8 – (0.5 · 10 + 51.6096) ≈ 23.3904 bits • If L < nUnprocessedChars, add to the dictionary a new phrase of lengthL + 1 by appending a new character to P and subtract L + 1 from nUnprocessedChars. • Otherwise, choose existing phrase of length nUnprocessedChars and stop. Optimizing XML Compression

  16. Each tree node corresponds to a container grouping Each node stores localcompression gain(localGain) and maximum potential compression gain(mpGain) values for its container grouping Nodes at depth i in tree represent all possibilities for assigning container i to a container grouping Crucial to avoid enumerating the entire search tree… Branch-and-Bound ProcedureSearch Tree {C1} {C1},{C2} {C1,C2} Remaining nodes at the bottom level of search tree represent the set of candidate partitioning strategies. Bounding criterion: Kill each subtree rooted at a node n for which mpGain(n) < optGain - δ Best local gain witnessed so far Threshold value in+ Optimizing XML Compression

  17. mpGain is less than 126.3966 – 15.0 = 111.3966, so kill without exploring this subtree Example:C = {C1,C2,C3}, where C1={aaabcaaabcaaabcabcab}, C2={15720653197608243849}, C3={abcababcbaaaabcabcab}, δ = 15.0 bits {C1} localGain = 50.4707 mpGain = 300.6970 {C1,C2} localGain = 0 mpGain = 108.6180 {C1},{C2} localGain = 50.4707 mpGain = 168.9804 mpGains of 1st and 3rd children are less than 126.3966 – 15.0 = 111.3966, so kill both {C1,C3},{C2} is the only candidate partitioning strategy Left child has highest mpGain, so we explore it first {C1},{C2,C3} localGain = 50.4707 mpGain = 50.4707 {C1},{C2},{C3} localGain = 100.9413 mpGain = 100.9413 {C1,C3},{C2} localGain = 126.3966 mpGain = 126.3966 Optimizing XML Compression

  18. Determining an Optimal Compression Configuration For each candidate partitioning strategy G returned by the branch-and-bound procedure: • For each container subset S in G, assign the compression algorithm that achieves the best compression gain while obeying specified time bounds on compression/decompression. • Compute the overall compression gain for G as the sum of gains for each S in G. Choose the G with the highest overall compression gain, together with the corresponding algorithm assignment, as the compression configuration. Optimizing XML Compression

  19. Early Results δ = 120 bits Additional bounding criterion used: Only nodes having top 60 mpGain scores explored at each tree level Test System: IntelCore 2 Duo 3.16 GHz, 4GB RAM, Ubuntu 9.04 Desktop Optimizing XML Compression

  20. Future Work • Conduct more experiments involving the proposed approximation algorithm in concert with various permutation-based XML compressors (e.g., AXECHOP, XMill) • Seek improvements to the branch-and-bound procedure • Starting off with a “sprint” phase that greedily searches for the best localGain would allow large portions of the search tree to be killed at an earlier stage • Addition of an additional parameter that only explores top k nodes per tree level can greatly reduce memory and time costs • Improve the efficiency of the existing implementation! • Investigate alternative approximation algorithms for selecting a partitioning strategy Optimizing XML Compression

  21. Final Slide • Thank you • Questions? Optimizing XML Compression

More Related