Information Theory

Information Theory Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas)

What is Information? Sender The fundamental paradigm of information theory is represented here: A sender conveys information to a receiver across a communication channel. Information is any content to be conveyed. It is sent in the form of one or more messages. Information theory attempts to quantify the content of messages and the capacity of channels on which information may ﬂow Channel Receiver

Some Questions Information theory asks questions such as the following: • How much information is encoded in a particular message? • How eﬃciently can a given alphabet/language be transmitted? • What is the maximum capacity of a given transmission medium? • How is that capacity or eﬃciency reduced by interference/noise? These questions all have very precise and deep answers, that are beyond our scope here. We’ll only scratch the surface.

Why Do We Care? Information theory is very important in computer science. It aﬀects all communication, hardware design, protocol design, cryptography, fault-tolerance, etc. For example, in our current context it is useful to know how much information can be transmitted over a speciﬁc covert channel. This is the “bandwidth” of the channel. A useful way of measuring bandwidth is bits per second.

Quantifying Information: Thought Experiment How do you quantify the information content of a response to a “yes” or “no” question? What does that even mean? Maybe it means the following: • Sender has either a “yes” or a “no.” • Receiver knows that Sender has one of those two possibilities, but not which. • Sender wants to transmit enough data, but no more, to resolve entirely Receiver’s uncertainty. • How much data must Sender transmit? In those terms the answer is pretty clear: One bit of data. But what must be true for Receiver to interpret the answer?

Thought Experiment Lessons This suggests: • In some cases, it is possible to quantify the information content of a message. • Maybe an appropriate unit of information content is bits. • Sender and receiver must have some shared knowledge, included an agreed encoding scheme

Lessons • Information is any content that can be conveyed from a sender to a receiver across a communication channel. • Information theory attempts to quantify the amount of information in a message and the capacity of the channel. • Communication requires some shared knowledge.

Information Content Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas)

Quantifying Information Sender How much information is contained in each of the following messages? How would you decide? • An n-bit binary number. • A single decimal digit. • A two digit decimal number. • “The attack is at dawn.” Channel Receiver

Quantifying Information Sender Suppose Sender and Receiver have agreed that Sender will transmit one of exactly 16 messages. How many bits of information must be transmitted to convey the message? Why? Channel Receiver

Binary Tree Representation • Suppose we represent all possible choices in any decision as the leaves of a “complete” binary tree. The longest path is , if there are k leaves on the tree. • Each choice (bit) reduces the search space by half.

Encoding That suggests that the information content of any message from a space of k messages should be something like: But to get a very eﬃcient transmission: • Sender and Receiver must know in advance the space ofpossible transmissions. • They must have agreed on an encoding.

How Much Information? How much information is contained in the message: “The attack is at dawn”? Answer: It depends on the Receiver’s level of uncertainty. • If the only uncertainty were whether at dawn or dusk: one bit. • If the attack could have come anytime during the day: ? bits. • If the day was uncertain...: ? bits. “The word ‘information’ relates not so much to what you do say, as to what you could say. ... That is, information is the measure of your freedom of choice when you select a message.” –Warren Weaver

Lessons • The information content of a message is the amount of uncertainty it resolves. • In an ideal situation, each bit transmitted can reduce the uncertainty by half. • Very few circumstances are ideal in the required sense

Exploring Encodings Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas)

Quantifying Information Information content is commonly measured in terms of bits. “Bit” has two connotations, which are not the same: bit1: a binary digit (discrete); bit2: a quantity of information (continuous). The information content of a message is measured in bit2s and the capacityof a channel in bit2s per second (bps). In general, the best way of transmitting (encoding) a series of messages is that way that minimizes the number of bit2s required, on average.

Finding an Encoding Four bits are adequate to encode 16 possible messages: M0, . . ., M15: Call this the naïve encoding. • Can we do better for one message? What would that mean? • How about transmitting n messages, each of which is one of 16 possible values?

Probability and Encoding Suppose you need to send 1000 messages, each of which can be one of 16 possibilities. But on average 99.5% will be message 10. Question: Does it still require 4 × 1000 = 4000 bits to send your 1000 messages? Answer: It is possible to come up with an encoding that will do better on average than the naïve encoding. Note, when we talk about sending 1000 messages, we’ve gone from talking about the information content of a message to talking about that of a language.

A Better Encoding • Use the following encoding: • Given 1000 messages, on average 995 of them will be message 10, and 5 will be other messages. This encoding takes 995 + (5 · 5) = 1020 bits or 1.02 bits per message.

Some Observations • Our encoding is pretty good, but can we do even better? Is there a limit to how well we can do? • Computing the number of bits per message depends on knowing the prior probabilities—how often each message appears in an arbitrarily long sequence of messages. • The “on average” part is important; some sequences would be less eﬃcient under our encoding. • We used the “naïve encoding” as our benchmark, but there are much worse encodings. • Is it possible to ﬁnd an optimal encoding? What would that mean?

Lessons • “Bit” has two distinct meanings that are easily confused. • For any language, one can ﬁnd a naïve encoding that will work, but it’s often possible to do better. • “Doing better” means using fewer bits, on average, to transmit messages in the language.

Languages & Encodings Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas)

Languages A language may refer to a natural language such as English or Japanese. But it may also refer to any set of symbols used to form “strings.” Often, we’ll use “language” to refer to the results of a series of experiments. Example 1: Assume a two-sided coin. Symbols in the language are “H” (heads) and “T” (tails). A string describes the result of repeatedly ﬂipping the coin. E.g., “HHTHTTTHTHTTHTT...” Example 2: Assume a six-sided die. Symbols in the language are “1”, “2”, “3”, “4”, “5”, “6”. A string describes the result of repeatedly rolling the die. E.g., “425146261435261...”

Encoding Properties For any language, our goal is a binary encoding, with the following properties: lossless: it must be possible to recover the entire original sequence of symbols from the transmission; uniquely decodable: for any encoded string, there must be only one possible decoding; streaming: there should be no breaks in the encoding.

Unique Decodability Suppose you roll a 6-sided die repeatedly. The following are some possible codes. Are they lossless, uniquely decodable, streaming? Which is “best”? Sufficient (but not necessary) for unique decodability is the property of being prefix-free: That is, the string representing any symbol cannot be an initial prefix of the string representing any other symbol.

Unique Decodability Imagine a language containing the symbols A, B, C. What’s wrong with the following encoding? A clue is that it is not preﬁx-free. An encoding can fail to have the preﬁx-free property, but still be uniquely decodable. Parsing such a language may require arbitrary “look-ahead.”

Finding a Coding How do you come up with an efficient encoding? Use fewer bits for the symbols that occur more frequently. Samuel Morse knew this instinctively. But Morse code doesn’t satisfy our criteria for encodings. Why not? Huffman coding is guaranteed to find an efficient code for a given language if you know the probabilities of symbols in the language.

Lessons • We use “language” for any scheme to generate a series of symbols. • For any language, we want an eﬃcient binary encoding that is lossless, uniquely decodable and streaming. • Huﬀman encoding will provide such an encoding assuming we know the probabilities of symbols in the language.

Entropy Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas)

An Encoding Example Recall the following language: Example 1: Assume a two-sided coin. Symbols in the language are “H” (heads) and “T” (tails). A string describes the result of repeatedly ﬂipping the coin. E.g., “HHTHTTTHTHTTHTT...” Suppose you know further that it’s a fair coin. The following is the naïve encoding for this language. Is there a better encoding? How would you know?

Entropy Claim: If you know the probabilities of symbols in the language, you can compute the average information content of a symbol in the language. For our fair coin example, the computed answer is 1 bit per symbol. Deﬁnition: The entropy of a language is a measure of the information content of an average symbol in the language.

Computing Entropy Entropy H is computed as follows. If pi is the probability of the ith symbol in the language, then From now on, all log’s are base 2. Example: Consider our fair coin example. Find the entropy. Solution: There are two symbols, each with probability 1/2, so: h = −(1/2 × log 1/2 + 1/2 × log 1/2) = 1

What Does It Mean? What does it mean to say that the entropy of this language is 1? • On average, there is one bit of information in each symbol of the language. • It is impossible to ﬁnd an encoding that uses less than one bit per symbol, on average. • Any encoding that uses one bit per symbol, on average, is optimal. Therefore, for this example, the naïve encoding is the optimal encoding.

Entropy Special Case Whenever you have n symbols, all equally probable, the probability of any of them is 1/n. The language has entropy: h = −(log 1/n) = log n For example, a fair die with six sides has entropy: h = −(log 1/6) = log 6 ≈ 2.58 Hence, it requires 2.58 bits, on average, to transmit the result of a roll.

Another Example Suppose we have an unbalanced coin that is three times more likely to yield a head than a tail. What is the entropy of this language? Solution: Again there are two possible outcomes: The entropy is computed as follows: h = −(3/4 × log 3/4 + 1/4 × log 1/4) ≈ 0.811 This says that it’s theoretically impossible to encode this language using less than 0.811 bits (on average) to transmit the result of each toss of the coin.

Lessons • Given the probabilities of symbols in a language, you can compute its entropy. • Entropy measures the average information content of symbols in the language. • Entropy sets a lower limit on encoding eﬃciency.

Example Revisited Given: an unbalanced coin that is three times more likely to yield a head than a tail. Solution: There are two possible outcomes: The entropy is computed as follows: h = −(3/4 × log 3/4 + 1/4 × log 1/4) ≈ 0.811 Upshot: It’s theoretically impossible to encode this language using less than 0.811 bits per symbol (on average). But how would you ever do better than 1 bit / symbol?

Example Revisited Instead of taking single flips as our “experiments,” let’s take pairs (call them “2flips”) and code as follows: Suppose we flip the coin 32 times; that’s 16 2flips. In an average run, we’d expect: HH to appear 9 times, HT and TH to each appear 3 times, and TT to appear once. Why?

Example Revisited Given 32 flips (16 2flips), we could expect: For the naïve encoding, using 1 bit / flip, we’d expect to use 32 bits. Our efficiency is 27/32 ≈ 0.844, which is not a bad approximation of the entropy (0.811). Could we do better? Sure, just use 3flips, 4flips, etc. The entropy is the limit of this process. Total: 27bits

Test Your Understanding Suppose you have a six-sided die that is unbalanced such that 1 and 2 are equally likely; 3 and 4 are equally likely; and 5 and 6 are equally likely. However, the die rolls 1 twice as often as 3, and rolls 3 three times as often as 5. • What is the “naïve ” encoding for this language? • What is the entropy of this language? • Find an encoding that is more eﬃcient than the naïve encoding. • Give a convincing argument that your encoding is more eﬃcient than the naive encoding. Hint: There’s no need to encode sequences of rolls

Lessons • Computing the entropy of a language provides a bound on the efficiency of any encoding. • But, finding an efficient encoding requires ingenuity.

Fundamental Theorems Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas)

Source Entropy The entropy of a language is a measure of the most eﬃcient possible encoding of the language. The entropyof a message source is the amount of information content that the source can produce in a given period. This is measured in bits per second. Any channel can transmit an arbitrary amount of information, given enough time and unbounded buﬀering capacity. But can a given channel transmit the information in real time?

Channel Capacity The capacity of a channel is the number of bits that can be sent per second over the channel. This is a property of the communication medium. Fundamental Theorem of the Noiseless Channel. (Shannon): If a language has entropy h (bits per symbol) and a channel can transmit C bits per second, then it is possible to encode the signal is such a way as to transmit at an average rate of (C/h) − ε symbols per second, where εcan be made arbitrarily small. It is impossible to transmit at an average rate greater than C/h.

What the Theorem Means Suppose a channel can handle 100 bits / second and your language has entropy 5 (bit per symbol). Given a perfect encoding and a noiseless channel, you’d expect to be able to transmit 20 symbols / second though the channel, on average. Right? But you may not have a perfect encoding. Doesn’t matter. You can always ﬁnd a better encoding to get within εof that limit.

Noisy Channels If the channel is noisy, the capacity is reduced by the noise. But the following is true: Fundamental Theorem of a Noisy Channel (Shannon): Let a discrete channel have a capacity C and a discrete source an entropy h (bits per second). If h < C there exists a coding system such that the output of the source can be transmitted over the channel with an arbitrarily small frequency of errors. If the channel (with noise factored in) can physically handle the message traﬃc, then it is possible to transmit with arbitrarily small error rate

What this Theorem Means The upshot of this is that a message can be transmitted reliably over even a very noisy channel by increasing the redundancy of the coding scheme. For example, covert channels in the system cannot be dismissed with the argument that they are noisy and hence useless. You can always get the message through by ﬁnding a more redundant encoding.

Lessons • Entropy provides a bound on coding eﬃciency. • But Shannon’s theorems show that it is always possible to approach that limit arbitrarily closely.

Entropy of English Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas)

Entropy vs. Redundancy Intuitively, the difference between the efficiency of the encoding and the entropy is a measure of the redundancy in the encoding. If you find an encoding with efficiency matching the entropy and there is no redundancy. The standard encoding for English contains a lot of redundancy. Frxmpl, y cnprbblygsswhtthssntnc sys, vnwthll f thvwlsmssng. Thtndctsthtthnfrmtncntntcn b xtrctdfrmthrmnngsmbls.

Information Theory