- 146 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Gzip Compression and Decompression' - cullen-randall

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Gzip Compression and Decompression

- 1. Gzip file format
- 2. Gzip Compress Algorithm

.LZ77 algorithm

.Dynamic Huffman coding algorithm

- 3. Gzip Decompression Algorithm
- 4. Other Method of data compression

and open questions

Gzip file format

- A gzip file consists of a series of “member”. The members simply appear one after another in the file, with no additional information before ,between or after them.
- Member format

Each member has the following format:

+---+---+---+---+---+---+---+---+---+---+

|ID1|ID2|CM|FLG| MTIME |XFL|OS| (more->)

+---+---+---+---+---+---+---+---+---+---+

if FLG.FEXTRA set

+---+---+---+---+---+---+---+---+---+---+

| XLEN | …XLEN bytes of “extra field” |(more->)

+---+---+---+---+---+---+---+---+---+---+

if FLG.FNAME set

+---+---+---+---+---+---+---+---+---+---+

| … original file name, zero-terminated …| (more->)

+---+---+---+---+---+---+---+---+---+---+

if FLG.COMMENT set

+---+---+---+---+---+---+---+---+---+---+

| … file comment, zero-terminated … |(more->)

+---+---+---+---+---+---+---+---+---+---+

if FLG.FHCRC set

+---+---+

| CRC16|

+---+---+

+====================+

| … compressed blocks | (more->)

+====================+

+---+---+---+---+---+---+---+---+

| CRC32 | INSIZE |

+---+---+---+---+---+---+---+---+

ID1=31,ID2=139, they are used to identify the file as being in gzip format.

CM (compression method)

This identifies the compression method in the file.

CM = 0-7 are reserved. CM = 8 denotes the “deflate”

compression method, which is the one customarily

used by gzip and which is documented elsewhere.

bit 0 FTEXT bit 1 FHCRC

bit 2 FEXTRA bit 3 FNAME

bit 4 FNAME others reserved.

CRC32

INSIZE original size of uncompressed data mod 2^32

2.Gzip compression algorithm

Introduction

Gzip combine the LZ77 algorithm and dynamic Huffman

algorithm to compress data. Gzip use LZ77 algorithm to

compress data first, then use dynamic Huffman algorithm

to compress the result.

2.1 LZ77 compression algorithm

Terms used in the algorithm:

.input stream :the sequence of characters to be compressed.

.character:the basic element in the input stream.

.coding position: the position of input stream being coded.

(the beginning of lookahead buffer)

.lookahead buffer: the character sequence from the coding

position to the end of input stream.

.window: size of w, contains w characters from coding

position backwards. i.e. the last w characters processed.

. A pointer points the match in the window and also

specifies its length.

The principle of encoding

The algorithm searches the window for longest match with

the lookahead buffer and output a pointer for that match. When

we find the match, we use data pair <offset, length> to take

place of the match.

Offset: the offset from the beginning of match to window’s

left bound. (length from coding position to the beginning

of match)

Length: length of match.

The encoding algorithm

step1: set the coding position to the beginning of input

stream

step2: if coding position is not at the end of input

stream, search the window for the longest match with the

lookahead buffer; else algorithm terminates.

step3: if find match, output (off, length,c), c is the character

following the match, coding position and window move

length+1 bytes forward; else goto step4.

step4: output current character at coding position,

coding position and windows move 1 byte forward; goto

step2.

Following is an example to explain the algorithm. Assume

the size of window is 10, the content is “abcdbbccaa”, the

string to be coded is “abaeaaabaee”. The steps of

encoding is following:

step1: the longest match between string and window is

“ab”, output (0,2,a), then window and coding position

move forward 3 bytes.

step2: the character at the current coding position is ‘e’.

content of window is “dbbccaaaba”, there is no match

with ‘e’, then output ‘e’. Window and coding position

move 1 byte forward.

step3: Content of window is “bbccaaabae”.Lookahead

buffer is “aaabae”, the longest match is itself. Then output

(4,6,e).

There are many other problems needed to be considered.

You can refer the gzip source code and document.

Dynamic Huffman Coding

Static Huffman coding algorithm:

Assume that we give a set of characters, and frequencies

of them. Then we can use the Huffman algorithm to

encode for these characters.

Dynamic Huffman coding process is a dynamic process to

build a Huffman tree. We don’t know the characters and

there frequency at first. Following is an example to

introduce the process of dynamic huffman algorithm:

String: TENNESSEE

During the dynamic process of building Huffman

tree, we must obey one rule: maintain the sibling

property if each node (except the root) has a

sibling and if the nodes can be numbered in order

of nondecreasing weight with each node adjacent

to its sibling. Moreover the parent of a node is

higher in the numbering

T

- Stage 1 (First occurrence of t )

r 9

/ \

7 0 t(1) 8

- Order: 0,t(1)

* r represents the root

* 0 represents the null node

*t(1) denotes the occurrence of T with a frequency of 1

TEN

- Stage 3 (First occurrence of n )

r 9

/ \

7 2 t(1) 8

/ \

5 1 e(1) 6

/ \

3 0 n(1) 4

- Order: 0,n(1),1,e(1),2,t(1)
- It is not a Huffman tree, we need to adjust it to Huffman tree

TENN

- Stage 4 ( Repetition of n )

r 9

/ \

7t(1) 3 8

/ \

5 2 e(1) 6

/ \

3 0 n(2) 4

- Order: 0,n(2),2,e(1),t(1),3
- Sibling property is no more valid, rebuild the tree.
- Swap this node with the node whose number is the biggest in the block.
- Block: a set of nodes whose weights are the same.
- In order to maintain the sibling property, we should swap node (n) with node (t), if the node has subtree, the subtree should be swapped together.

Reorder: TENN

r 9

/ \

7n(2) 2 8

/ \

5 1 e(1) 6

/ \

3 0 t(1) 4

- Order: 0,t(1),1,e(1),n(2),2
- t(1),n(2) are swapped

TENNE

- Stage 5 (Repetition of e )

r 9

/ \

7n(2) 3 8

/ \

5 1 e(2) 6

/ \

3 0 t(1) 4

- Order: 0,t(1),1,e(2),n(2),3

TENNES

- Stage 6 (First occurrence of s)

r 9

/ \

7n(2) 4 8

/ \

5 2 e(2) 6

/ \

3 1 t(1) 4

/ \

1 0 s(1) 2

- Order: 0,s(1),1,t(1),2,e(2),n(2),4

TENNESS

- Stage 7 (Repetition of s)

r 9

/ \

7n(2) 5 8

/ \

5 3 e(2) 6

/ \

3 2 t(1) 4

/ \

1 0 s(2) 2

- Order: 0,s(2),2,t(1),3,e(2),n(2),5
- Sibling property is not valid. Adjust the tree to maintain sibling property.

Reorder: TENNESS

r 9

/ \

7 3 4 8

/ \ / \

3 1 s (2) 45 n(2) e(2) 6

/ \

1 0 t(1) 2

- s(2) and t(1) are swapped
- e and 3 are also need to be swapped

TENNESSE

- Stage 8 (Second repetition of e )

r 9

/ \

7 3 5 8

/ \ / \

3 1 s (2) 45 n(2) e(3) 6

/ \

1 0 t(1) 2

- Order : 0,t(1),1,s(2),e(3),3,n(2),6

Reorder: TENNESSEE

r 9

/ \

7 3 6 8

/ \ / \

3 1 s (2) 45 n(2) e(4) 6

/ \

1 0 t(1) 2

sibling property is valid, need to rebuild the

Huffman tree.

TENNESSEE

- Stage 9 (Second repetition of e )

r 9

/ \

7 e(4) 5 8

/ \

5 n(2) 3 6

/ \

3 1 s(2) 4

/ \

1 0 t(1) 2

Adaptive Huffman decoding is the inverse

procedure of encoding.

Download Presentation

Connecting to Server..