gzip compression and decompression
Download
Skip this Video
Download Presentation
Gzip Compression and Decompression

Loading in 2 Seconds...

play fullscreen
1 / 22

Gzip Compression and Decompression - PowerPoint PPT Presentation


  • 146 Views
  • Uploaded on

Gzip Compression and Decompression. 1. Gzip file format 2. Gzip Compress Algorithm . LZ77 algorithm .Dynamic Huffman coding algorithm 3. Gzip Decompression Algorithm 4. Other Method of data compression and open questions. Gzip file format

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Gzip Compression and Decompression' - cullen-randall


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
gzip compression and decompression
Gzip Compression and Decompression
  • 1. Gzip file format
  • 2. Gzip Compress Algorithm

.LZ77 algorithm

.Dynamic Huffman coding algorithm

  • 3. Gzip Decompression Algorithm
  • 4. Other Method of data compression

and open questions

slide2
Gzip file format
  • A gzip file consists of a series of “member”. The members simply appear one after another in the file, with no additional information before ,between or after them.
  • Member format

Each member has the following format:

+---+---+---+---+---+---+---+---+---+---+

|ID1|ID2|CM|FLG| MTIME |XFL|OS| (more->)

+---+---+---+---+---+---+---+---+---+---+

if FLG.FEXTRA set

+---+---+---+---+---+---+---+---+---+---+

| XLEN | …XLEN bytes of “extra field” |(more->)

+---+---+---+---+---+---+---+---+---+---+

slide3
if FLG.FNAME set

+---+---+---+---+---+---+---+---+---+---+

| … original file name, zero-terminated …| (more->)

+---+---+---+---+---+---+---+---+---+---+

if FLG.COMMENT set

+---+---+---+---+---+---+---+---+---+---+

| … file comment, zero-terminated … |(more->)

+---+---+---+---+---+---+---+---+---+---+

if FLG.FHCRC set

+---+---+

| CRC16|

+---+---+

+====================+

| … compressed blocks | (more->)

+====================+

slide4
+---+---+---+---+---+---+---+---+

| CRC32 | INSIZE |

+---+---+---+---+---+---+---+---+

ID1=31,ID2=139, they are used to identify the file as being in gzip format.

CM (compression method)

This identifies the compression method in the file.

CM = 0-7 are reserved. CM = 8 denotes the “deflate”

compression method, which is the one customarily

used by gzip and which is documented elsewhere.

bit 0 FTEXT bit 1 FHCRC

bit 2 FEXTRA bit 3 FNAME

bit 4 FNAME others reserved.

CRC32

INSIZE original size of uncompressed data mod 2^32

slide5
2.Gzip compression algorithm

Introduction

Gzip combine the LZ77 algorithm and dynamic Huffman

algorithm to compress data. Gzip use LZ77 algorithm to

compress data first, then use dynamic Huffman algorithm

to compress the result.

2.1 LZ77 compression algorithm

Terms used in the algorithm:

.input stream :the sequence of characters to be compressed.

.character:the basic element in the input stream.

.coding position: the position of input stream being coded.

(the beginning of lookahead buffer)

.lookahead buffer: the character sequence from the coding

position to the end of input stream.

slide6
.window: size of w, contains w characters from coding

position backwards. i.e. the last w characters processed.

. A pointer points the match in the window and also

specifies its length.

The principle of encoding

The algorithm searches the window for longest match with

the lookahead buffer and output a pointer for that match. When

we find the match, we use data pair <offset, length> to take

place of the match.

Offset: the offset from the beginning of match to window’s

left bound. (length from coding position to the beginning

of match)

Length: length of match.

The encoding algorithm

slide7
step1: set the coding position to the beginning of input

stream

step2: if coding position is not at the end of input

stream, search the window for the longest match with the

lookahead buffer; else algorithm terminates.

step3: if find match, output (off, length,c), c is the character

following the match, coding position and window move

length+1 bytes forward; else goto step4.

step4: output current character at coding position,

coding position and windows move 1 byte forward; goto

step2.

Following is an example to explain the algorithm. Assume

the size of window is 10, the content is “abcdbbccaa”, the

string to be coded is “abaeaaabaee”. The steps of

encoding is following:

slide8
step1: the longest match between string and window is

“ab”, output (0,2,a), then window and coding position

move forward 3 bytes.

step2: the character at the current coding position is ‘e’.

content of window is “dbbccaaaba”, there is no match

with ‘e’, then output ‘e’. Window and coding position

move 1 byte forward.

step3: Content of window is “bbccaaabae”.Lookahead

buffer is “aaabae”, the longest match is itself. Then output

(4,6,e).

There are many other problems needed to be considered.

You can refer the gzip source code and document.

slide9
Dynamic Huffman Coding

Static Huffman coding algorithm:

Assume that we give a set of characters, and frequencies

of them. Then we can use the Huffman algorithm to

encode for these characters.

Dynamic Huffman coding process is a dynamic process to

build a Huffman tree. We don’t know the characters and

there frequency at first. Following is an example to

introduce the process of dynamic huffman algorithm:

String: TENNESSEE

During the dynamic process of building Huffman

tree, we must obey one rule: maintain the sibling

property if each node (except the root) has a

sibling and if the nodes can be numbered in order

of nondecreasing weight with each node adjacent

to its sibling. Moreover the parent of a node is

higher in the numbering

slide10
T
  • Stage 1 (First occurrence of t )

r 9

/ \

7 0 t(1) 8

  • Order: 0,t(1)

* r represents the root

* 0 represents the null node

*t(1) denotes the occurrence of T with a frequency of 1

slide11
TE
  • Stage 2 (First occurrence of e)

r 9

/ \

7 1 t(1) 8

/ \

5 0 e(1) 6

  • Order: 0,e(1),1,t(1)
slide12
TEN
  • Stage 3 (First occurrence of n )

r 9

/ \

7 2 t(1) 8

/ \

5 1 e(1) 6

/ \

3 0 n(1) 4

  • Order: 0,n(1),1,e(1),2,t(1)
  • It is not a Huffman tree, we need to adjust it to Huffman tree
reorder ten
Reorder: TEN

r 9

/ \

7t(1) 2 8

/ \

5 1 e(1) 6

/ \

3 0 n(1) 4

  • Order: 0,n(1),1,e(1),t(1),2
slide14
TENN
  • Stage 4 ( Repetition of n )

r 9

/ \

7t(1) 3 8

/ \

5 2 e(1) 6

/ \

3 0 n(2) 4

  • Order: 0,n(2),2,e(1),t(1),3
  • Sibling property is no more valid, rebuild the tree.
  • Swap this node with the node whose number is the biggest in the block.
  • Block: a set of nodes whose weights are the same.
  • In order to maintain the sibling property, we should swap node (n) with node (t), if the node has subtree, the subtree should be swapped together.
reorder tenn
Reorder: TENN

r 9

/ \

7n(2) 2 8

/ \

5 1 e(1) 6

/ \

3 0 t(1) 4

  • Order: 0,t(1),1,e(1),n(2),2
  • t(1),n(2) are swapped
tenne
TENNE
  • Stage 5 (Repetition of e )

r 9

/ \

7n(2) 3 8

/ \

5 1 e(2) 6

/ \

3 0 t(1) 4

  • Order: 0,t(1),1,e(2),n(2),3
tennes
TENNES
  • Stage 6 (First occurrence of s)

r 9

/ \

7n(2) 4 8

/ \

5 2 e(2) 6

/ \

3 1 t(1) 4

/ \

1 0 s(1) 2

  • Order: 0,s(1),1,t(1),2,e(2),n(2),4
tenness
TENNESS
  • Stage 7 (Repetition of s)

r 9

/ \

7n(2) 5 8

/ \

5 3 e(2) 6

/ \

3 2 t(1) 4

/ \

1 0 s(2) 2

  • Order: 0,s(2),2,t(1),3,e(2),n(2),5
  • Sibling property is not valid. Adjust the tree to maintain sibling property.
reorder tenness
Reorder: TENNESS

r 9

/ \

7 3 4 8

/ \ / \

3 1 s (2) 45 n(2) e(2) 6

/ \

1 0 t(1) 2

  • s(2) and t(1) are swapped
  • e and 3 are also need to be swapped
tennesse
TENNESSE
  • Stage 8 (Second repetition of e )

r 9

/ \

7 3 5 8

/ \ / \

3 1 s (2) 45 n(2) e(3) 6

/ \

1 0 t(1) 2

  • Order : 0,t(1),1,s(2),e(3),3,n(2),6
reorder tennessee
Reorder: TENNESSEE

r 9

/ \

7 3 6 8

/ \ / \

3 1 s (2) 45 n(2) e(4) 6

/ \

1 0 t(1) 2

sibling property is valid, need to rebuild the

Huffman tree.

tennessee
TENNESSEE
  • Stage 9 (Second repetition of e )

r 9

/ \

7 e(4) 5 8

/ \

5 n(2) 3 6

/ \

3 1 s(2) 4

/ \

1 0 t(1) 2

Adaptive Huffman decoding is the inverse

procedure of encoding.

ad