A Bit of Information Theory

1 / 18

# A Bit of Information Theory - PowerPoint PPT Presentation

A Bit of Information Theory. Unsupervised Learning Working Group Assaf Oron, Oct. 15 2003. Based mostly upon: Cover &amp; Thomas, “Elements of Inf. Theory”, 1991. Contents. Coding and Transmitting Information Entropy etc. Information Theory and Statistics

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'A Bit of Information Theory' - dixon

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### A Bit of Information Theory

Unsupervised Learning Working Group

Assaf Oron, Oct. 15 2003

Based mostly upon:

Cover & Thomas, “Elements of Inf. Theory”, 1991

Contents
• Coding and Transmitting Information
• Entropy etc.
• Information Theory and Statistics
• Information Theory and “Machine Learning”
What is Coding? (1)
• We keep coding all the time
• Crucial requirement for coding: “source” and “receiver” agree on the key.
• Modern coding: telegraph->radio->…
• Practical problems: How efficient can we make it? Tackled from 20’s on.
• 1940’s: Claude Shannon
What is Coding? (2)
• Shannon’s greatness: finding a solution of the “specific” problem, by working on the “general” problem.
• Namely: how does one quantify information, its coding and its transmission?
• ANY type of information
Information Complexity of Some Coded Messages
• Let’s think written numbers:
• k digits → 10k possible messages
• How about written English?
• k letters → 26k possible messages
• k words → Dk possible messages, where D is English dictionary size

∴ Length ~ log(complexity)

Information Entropy
• The expected length (bits) of a binary message conveying x-type information
• other common descriptions: “code complexity”, “uncertainty”, “missing/required information”, “expected surprise”, “information content” (BAD), etc.
Why “Entropy”?
• Thermodynamics (mid 19th): “amount of un-usable heat in system”
• Statistical Physics (end 19th): “log (complexity of current system state)”
• ⇉ amount of “mess” in the system
• The two were proven to be equivalent
• Statistical entropy is proportional to information entropy if p(x) is uniform
• 2nd Law of Thermodynamics…
• Entropy never decreases (more later)
Kullback-Leibler Divergence(“Relative Entropy”)
• In words: “the excess message length needed to use p(x)-optimized code for messages based on q(x)”
• Properties, Relation to H:
Mutual Information
• Relationship to D,H(hint: cond. Prob.):
• Properties, Examples:
Entropy for Continuous RV’s
• “Little” h, Defined in the “natural” way
• However it is not the same measure:
• h of discrete RV’s is always 0, and H of continuous RV’s is infinite (measure theory…)
• For many continuous distributions, h is log (variance) plus some constant
• Why?
The Statistical Connection (1)
• K-L D⇔ Likelihood Ratio
• Law of large numbers can be rephrased as a limit on D
• For dist.’s with same variance, normal is the one with maximum h.
• (2nd law of thermodynamics revisited)
• h is an average quantity. Is the CLT, then, a “law of nature”?… (I think: “YES”!)
The Statistical Connection (2)
• Mutual information is very useful
• Certainly for discrete RV’s
• Also for continuous (no dist. assumptions!)
• A lot of implications for stochastic processes, as well
• I just don’t quite understand them
• English?
Machine Learning? (1)
• So far, we haven’t mentioned noise
• In inf. Theory, noise exists in the channel
• Channel capacity: max(mutual information) between “source”, “receiver”
• Noise directly decreases the capacity
• Shannon’s “Biggest” result: this can be (almost) achieved with (almost) zero error
• Known as the “Channel Coding Theorem”
Machine Learning? (2)
• The CCT inspired practical developments
• Now it all depends on code and channel!
• Smarter, “error-correcting” codes
• Tech developments focus on channel capacity
Machine Learning? (3)
• Can you find analogy between coding and classification/clustering? (can it be useful??)
Machine Learning? (4)
• Inf. Theory tells us that:
• We CAN find a nearly optimal classification or clustering rule (“coding”)
• We CAN find a nearly optimal parameterization+classification combo
• Perhaps the newer wave of successful, but statistically “intractable” methods (boosting etc.) works by increasing channel capacity (i.e, high-dim parameterization)?