a bit of information theory n.
Skip this Video
Loading SlideShow in 5 Seconds..
A Bit of Information Theory PowerPoint Presentation
Download Presentation
A Bit of Information Theory

Loading in 2 Seconds...

play fullscreen
1 / 18

A Bit of Information Theory - PowerPoint PPT Presentation

  • Uploaded on

A Bit of Information Theory. Unsupervised Learning Working Group Assaf Oron, Oct. 15 2003. Based mostly upon: Cover & Thomas, “Elements of Inf. Theory”, 1991. Contents. Coding and Transmitting Information Entropy etc. Information Theory and Statistics

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'A Bit of Information Theory' - braith

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a bit of information theory

A Bit of Information Theory

Unsupervised Learning Working Group

Assaf Oron, Oct. 15 2003

Based mostly upon:

Cover & Thomas, “Elements of Inf. Theory”, 1991

  • Coding and Transmitting Information
  • Entropy etc.
  • Information Theory and Statistics
  • Information Theory and “Machine Learning”
what is coding 1
What is Coding? (1)
  • We keep coding all the time
  • Crucial requirement for coding: “source” and “receiver” agree on the key.
  • Modern coding: telegraph->radio->…
    • Practical problems: How efficient can we make it? Tackled from 20’s on.
    • 1940’s: Claude Shannon
what is coding 2
What is Coding? (2)
  • Shannon’s greatness: finding a solution of the “specific” problem, by working on the “general” problem.
  • Namely: how does one quantify information, its coding and its transmission?
    • ANY type of information
information complexity of some coded messages
Information Complexity of Some Coded Messages
  • Let’s think written numbers:
    • k digits → 10k possible messages
  • How about written English?
    • k letters → 26k possible messages
    • k words → Dk possible messages, where D is English dictionary size

∴ Length ~ log(complexity)

information entropy
Information Entropy
  • The expected length (bits) of a binary message conveying x-type information
    • other common descriptions: “code complexity”, “uncertainty”, “missing/required information”, “expected surprise”, “information content” (BAD), etc.
why entropy
Why “Entropy”?
  • Thermodynamics (mid 19th): “amount of un-usable heat in system”
  • Statistical Physics (end 19th): “log (complexity of current system state)”
    • ⇉ amount of “mess” in the system
    • The two were proven to be equivalent
    • Statistical entropy is proportional to information entropy if p(x) is uniform
  • 2nd Law of Thermodynamics…
    • Entropy never decreases (more later)
kullback leibler divergence relative entropy
Kullback-Leibler Divergence(“Relative Entropy”)
  • In words: “the excess message length needed to use p(x)-optimized code for messages based on q(x)”
  • Properties, Relation to H:
mutual information
Mutual Information
  • Relationship to D,H(hint: cond. Prob.):
  • Properties, Examples:
entropy for continuous rv s
Entropy for Continuous RV’s
  • “Little” h, Defined in the “natural” way
  • However it is not the same measure:
    • h of discrete RV’s is always 0, and H of continuous RV’s is infinite (measure theory…)
  • For many continuous distributions, h is log (variance) plus some constant
    • Why?
the statistical connection 1
The Statistical Connection (1)
  • K-L D⇔ Likelihood Ratio
  • Law of large numbers can be rephrased as a limit on D
  • For dist.’s with same variance, normal is the one with maximum h.
    • (2nd law of thermodynamics revisited)
    • h is an average quantity. Is the CLT, then, a “law of nature”?… (I think: “YES”!)
the statistical connection 2
The Statistical Connection (2)
  • Mutual information is very useful
    • Certainly for discrete RV’s
    • Also for continuous (no dist. assumptions!)
  • A lot of implications for stochastic processes, as well
    • I just don’t quite understand them
    • English?
machine learning 1
Machine Learning? (1)
  • So far, we haven’t mentioned noise
    • In inf. Theory, noise exists in the channel
    • Channel capacity: max(mutual information) between “source”, “receiver”
    • Noise directly decreases the capacity
  • Shannon’s “Biggest” result: this can be (almost) achieved with (almost) zero error
    • Known as the “Channel Coding Theorem”
machine learning 2
Machine Learning? (2)
  • The CCT inspired practical developments
    • Now it all depends on code and channel!
    • Smarter, “error-correcting” codes
    • Tech developments focus on channel capacity
machine learning 3
Machine Learning? (3)
  • Can you find analogy between coding and classification/clustering? (can it be useful??)
machine learning 4
Machine Learning? (4)
  • Inf. Theory tells us that:
    • We CAN find a nearly optimal classification or clustering rule (“coding”)
    • We CAN find a nearly optimal parameterization+classification combo
    • Perhaps the newer wave of successful, but statistically “intractable” methods (boosting etc.) works by increasing channel capacity (i.e, high-dim parameterization)?