Information and Coding Theory Transmission over lossless channels. Entropy. Compression codes -

Download Presentation

Information and Coding Theory Transmission over lossless channels. Entropy. Compression codes -

Loading in 2 Seconds...

- 75 Views
- Uploaded on
- Presentation posted in: General

Information and Coding Theory Transmission over lossless channels. Entropy. Compression codes -

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Information and Coding Theory

Transmission over lossless channels. Entropy. Compression codes -

Shannon code, Huffman code, arithmetic code.

Juris Viksna, 2014

We will focus on compression/decompression parts, assuming that there

are no losses during transmission.

[Adapted from D.MacKay]

[Adapted from D.MacKay]

How many bits we need to transfer a particular piece of information?

All possible n bit messages, each with probability

1/2n

Receiver

Noiseless channel

Obviously n bits will be sufficient.

Also, it is not hard to guess that n bits will be necessary to distinguish

between all possible messages.

All possible n bit messages.

Msg. Prob.

000000... ½

111111... ½

other 0

Receiver

Noiseless channel

n bits will still be sufficient.

However, we can do quite nicely with just 1 bit!

- All possible n bit messages.
- Msg. Prob.
- 00 ¼
- 01 ¼
- ½
- 0

Receiver

Noiseless channel

Try to use 2 bits for “00” and “01” and 1 bit for “10”:

00 00

01 01

10 1

All possible n bit messages, the probability of message i being pi.

Receiver

Noiseless channel

We can try to generalize this by defining entropy (the minimal average

number of bits we need to distinguish between messages) in the

following way:

Derived from the Greek εντροπία "a turning towards"

(εν- "in" + τροπή "a turning").

The entropy, H, of a discrete random variable X is a measure of the

amount of uncertainty associated with the value of X.

[Adapted from T.Mitchell]

[Adapted from T.Mitchell]

Example

[Adapted from D.MacKay]

NB!!!

If not explicitly stated otherwise, in this course (as well in Computer Science in general)expressions log x denote logarithm of base 2 (i.e. log2 x).

[Adapted from D.MacKay]

The entropy, H, of a discrete random variable X is a measure of the

amount of uncertainty associated with the value of X.

[Adapted from T.Mitchell]

[Adapted from T.Mitchell]

[Adapted from T.Mitchell]

Entropy of a Bernoulli

trial as a function of success

probability, often called

the

binary entropy function,

Hb(p).

The entropy is maximized

at 1 bit per trial when the

two possible outcomes are

equally probable, as in

an unbiased coin toss.

[Adapted from www.wikipedia.org]

[Adapted from D.MacKay]

Entropy is maximized if probability distribution is uniform – i.e. all

probabilities pi are equal.

Sketch of proof:

Assume probabilities p and q, then taking both probabilities equal to (p+q)/2 entropy does not decrease.

H(p,q) = – (p log p + q log q)

H((p+q)/2, (p+q)/2) = – ((p+q)/2 log ((p+q)/2))

– ((p+q)/2 log ((p+q)/2)) + (p log p + q log q)

– ((p+q)/2 log ((pq)1/2) + (p log p + q log q)

– ((p+q) (log p + log q) + (p log p + q log q) (p –q)(log p – log q) 0

In addition we need also some smoothness assumptions about H.

Assume that we have a set of symbols with known frequencies

of symbol occurrences. We have assumed that on average we will

need H() bits to distinguish between symbols.

What about sequences of length n of symbols from (assuming

independent occurrence of each symbol with the given frequency)?

The entropy of n will be:

it turns out that H(n) = nH().

Later we will show that (assuming some restrictions) the encoding that

use nH() bits on average are the best we can get.

The joint entropy of two discrete random variables X and Y is

merely the entropy of their pairing: (X,Y). This implies that if

X and Y are independent, then their joint entropy is the sum of their

individual entropies.

[Adapted from D.MacKay]

The conditional entropy of X given random variable Y (also called

the equivocation of X about Y) is the average conditional entropy

over Y:

[Adapted from D.MacKay]

[Adapted from D.MacKay]

Mutual information measures the amount of information that can

be obtained about one random variable by observing another.

Mutual information is symmetric:

[Adapted from D.MacKay]

Relations between entropies, conditional entropies, joint entropy and mutual information.

[Adapted from D.MacKay]

[Adapted from D.MacKay]

Straightforward approach - use 3 bits to encode each character

(e.g. '000' for a, '001' for b, '010' for c, '011' for d, '100' for e, '101'

for f).

The length of the data file then will be 300 000.

Can we do better?

[Adapted from S.Cheng]

[Adapted from S.Cheng]

[Adapted from S.Cheng]

[Adapted from S.Cheng]

[Adapted from S.Cheng]

[Adapted from S.Cheng]

[Adapted from S.Cheng]

[Adapted from S.Cheng]

Is this prefix code optimal?

[Adapted from S.Cheng]

[Adapted from S.Cheng]

[Adapted from M.Brookes]

[Adapted from S.Cheng]

[Adapted from S.Cheng]

[Adapted from S.Cheng]

[Adapted from S.Cheng]

[Adapted from S.Cheng]

[Adapted from S.Cheng]

Construct Huffman code for symbols with frequencies:

A15

D6

F6

H3I1M2N2

U2

V2

#7

[Adapted from H.Lewis, L.Denenberg]

[Adapted from S.Cheng]

[Adapted from S.Cheng]

[Adapted from S.Cheng]

[Adapted from S.Cheng]

Huffman codes are optimal!

[Adapted from S.Cheng]

[Adapted from H.Lewis and L.Denenberg]

- Proof by induction:
- n = 1OK
- assume T is obtained by Huffman algorithm and X is an optimal tree.
- Construct T’ and X’ as described by lemma. Then:
- w(T’) w(X’)
- w(T) = w(T’)+C(n1)+C(n2)
- w(X) w(X’)+C(n1)+C(n2)
- w(T) w(X)

[Adapted from H.Lewis and L.Denenberg]

- W() - average number of bits used by Huffman code
- H() - entropy
- Then H() W()<H()+1.
- Assume all probabilities are in form 1/2k.
- Then we can prove by induction that H() =W() (we can state that symbol with probability 1/2k. will always be at depth k)
- obvious if ||=1 or ||=2
- otherwise there will always be two symbols having smallest probabilities both equal to 1/2k
- these will be joined by Huffman algorithm, thus we reduced the problem to alphabet containing one symbol less.

- W() - average number of bits used by Huffman code
- H() - entropy
- Then W()<H()+1.
- Consider symbols a with probabilities 1/2k+1 p(a) < 1/2k
- modify alphabet: for each a reduce its probability to 1/2k+1
- add extra symbols with probabilities in form 1/2k (so that all powers for these are different)
- construct Huffman encoding tree
- the depth of initial symbols will be k+1, thus W() < H()+1
- we can prune the tree deleting extra symbols, this will only
- decrease W()

Can we claim that H() W()<H()+1?

In general case symbol with probability 1/2k can be at depth other than k:

Consider two symbols with probabilities 1/2k and 1 1/2k, both of them

will be at depth 1. However changing both probabilities to ½ the entropy will only increase.

By induction we can show that all symbol probabilities can be all changed to have a form 1/2k in such a way that entropy does not decrease and the Huffman tree does not change its structure.

Thus we always will have H() W()<H()+1.

Unlike the variable-length codes described previously, arithmetic coding, generates non-block codes. In arithmetic coding, a one-to-one correspondence between source symbols and code words does not exist. Instead, an entire sequence of source symbols (or message) is assigned a single arithmetic code word.

The code word itself defines an interval of real numbers between 0 and 1. As the number of symbols in the message increases, the interval used to represent it becomes smaller and the number of information units (say, bits) required to represent the interval becomes larger. Each symbol of the message reduces the size of the interval in accordance with the probability of occurrence. It is supposed to approach the limit set by entropy.

Let the message to be encoded be a1a2a3a3a4

0.072

0.0688

0.8

0.16

0.4

0.056

0.0624

0.08

0.06496

0.2

0.048

0.0592

0.06368

0.04

- So, any number in the interval [0.06752,0.0688) , for example 0.068 can be used to represent the message.
- Here 3 decimal digits are used to represent the 5 symbol source message. This translates into 3/5 or 0.6 decimal digits per source symbol and compares favourably with the entropy of
- -(3x0.2log100.2+0.4log100.4) = 0.5786 digits per symbol
- As the length of the sequence increases, the resulting arithmetic code approaches the bound set by entropy.
- In practice, the length fails to reach the lower bound, because:
- The addition of the end of message indicator that is needed to separate one message from another
- The use of finite precision arithmetic

Decoding:

Decode 0.572.

Since 0.8>code word > 0.4, the first symbol should be a3.

1.0

0.8

0.72

0.592

0.5728

0.5856

0.57152

0.8

0.72

0.688

0.5728

056896

0.4

0.56

0.624

Therefore, the message is:

a3a3a1a2a4

0.2

0.48

0.592

0.5664

0.56768

0.0

0.4

0.56

0.56

0.5664