PMAI 2001/02 Targil 1: Basic Probability and Information theory

PMAI 2001/02Targil 1: Basic Probability and Information theory (see further details in handouts)

Random Variables and Probability ARandom variableXdescribes a value that is the result of a process which can not be determine in advance. The shape of the card we draw from a deck is a random variable. The Sample space S of a random variable is the set of all possible values of the variable. An event is a subset of S. A random variable is not necessarily random in the true sense of the word. Usually we deal with random variables which obey some law of nature, or a background probability. Still, we can not determine it’s value, but the chance of encountering a certain value. The probability function p(x) defines this chance.

Random Variables and Probability (2) Probability is our way of formally describing our world and evaluating uncertainties.We define our world  as a set of random variables and divide it into elementary events, or states. The states are mutually exclusive. A probability space holds the following properties: For a continuous random variable, p(x) is the distribution density function: Note: p(x) can be greater then 1 because it is not a probability value.

Expectation, Variance and Deviation The moments of a random variable define important characteristics of random variables: The first moment is the expectation: Note: The expectation has a misleading name and is not always the value we expect to see most. In the case of the number on a dice the expectation is 3.5 which is not a value we will see at all!. The expectation is as a weighted average. The variance is defined by Var[x] = <x2> - <x>2 = M2 - M12. The standard deviation  = Var[x]1/2 evaluates the “spread factor”or x in relation to the mean.

Conditional probability and Bayes’ Rule The knowledge that a certain event occurred can change the probability that another event will occur. P(x|y) denotes the conditional probability that x will happen given that y has happened. Bayes’ rule states that: The complete probability formula states that P(A) = P(A|B)P(B) + P(A|  B)P(B) or in the more general case Note: P(A) = P(A|B) + (1- )P(A|B) mirrors our intuition that the unconditional probability of an event is somewhere between it’s conditional probability based on two opposing assumptions.

The importance of conditional probability:Simpson’s Paradox The following table describes the effectiveness of a certain drug on a population: The ratio of recovery for the whole population increases from 40/50 to 105/90. Surprisingly, the ratio of recovery decreased for males as well as for females. How can this be? The paradox lies in ignoring the context, or the condition in which the results where given. If we had written the facts in terms of correct conditional probabilities (in a 50% male population) instead of absolute numbers we would get: P(Recovery|Drug) = 1/2*15/55 + 1/2*90/140 = 0.4578

Independence between random variables Two random variables X and Y are independent if knowledge about X does not change the uncertainty about Y and vice versa. Formally for a probability distribution P: Ind(X;Y)  P(X|Y) = P(X) (and symmetrically)  If Ind(X;Y) holds then P(X,Y) = P(X)P(Y)  If Ind(X;Y,Z) holds then P(X,Y,Z)=P(X)P(Y,Z)  Ind(X;Y,Z)  Ind(X;Y) and Ind(X;Z) but not the other way around!  If Ind(X;Y|Z) holds then P(X,Y|Z)=P(X|Z)P(Y|Z) or P(X|Y,Z)=P(X|Z)

Stating your facts carefully:The Three Prisoners Paradox Three prisoners A,B and C have been tried for murder and are expecting their sentence in the morning. They are told that only one of them will be convicted and that the others will go free. They also know that the guard knows which one of them is to be convicted. At night, prisoner A call the guard and asks him to pass a letter to one of his innocent friends (there is at least one so the guard’s action should not change anything). After a while, A asks the guard to tell him the identity of the prisoner who received the letter. Again, the guard agrees to tell him it is B as it does not appear to change anything. A, in his bunk, thinks: “Before I talked to the guard my chances of being executed were 1/3. Now that I know that B is innocent my chances of dying have gone from 33% to 50%. I made sure I did not ask anything about my own fate. What did I do wrong?”

Stating your facts carefully:The Three Prisoners Paradox (2) Let us try and work out the problem in probabilistic terms. Let GA denote the event of A being guilty and IB of B being innocent.P(IB|GA)=1: worse than that, we would have gotten the same answer if the guard said that C is innocent. How can this be? The paradox comes from not defining the context correctly. IB was concluded from IB’=“The guard will declare B innocent”. P(IB’)=1/2 and not 2/3. Rewriting the equations we get:

Bayes’ rule and inference Bayes’ is fundamental in learning when viewed in terms of evidence and hypothesis: This allows us to update our belief in a hypothesis based on prior belief and in response to evidence. likelyhood priorprobability posterioirprobability

Information Theory: Entropy We want a fundamental measure that will allow us to gauge the amount of missing information in a distribution, or a set of possible states. A measure which gives us such an estimate is the entropy: • H[P]  0 (and is 0 when there is one possible outcome) • H[P] is concave: H[P1+(1-)P2]  H[P1]+(1+)H[P2] • The entropy get it’s maximal value when there is the most uncertainty. This intuitively occurs where p(x) is uniform (we do not prefer any state to the other).

Entropy (2) The joint entropy is defined by: The conditional entropy is defined by: The relative entropy (or the “distance between two distributions) is: and equals 0 only if p(x) and q(x) are the same. .

Information and Intuition The mutual information between two random variables is: • The information of X on Y is same as of Y on X. • I(X;Y)=0 if X and Y are independent • The mutual missing information H(X,Y) is the missing information of X and the conditional missing information of Y given that X is known • The information of Y on X is the improvement of missing information about X once Y is known.

Binomial and Poisson distributions Assume we perform n independent experiments (Bernoulli’s) with a probability p for “success” and 1-p for “failure” The distribution on the number of “success” k, called the binomial distribution is given by: In the limit where (constant), we approximate pn(k) by the Poisson distribution: the Poisson distribution is especially important in physical processes in time (or space) such as the firing rate of a neuron, etc.

The Normal distribution The one-dimensional normal distribution, characterized by the expectation  and standard deviation  is: This distribution is a good approximation for the binomial distribution where n is large (and np>>0) and in the “vicinity”of . The normal distribution has desired computational properties which we will meet later in the course. For example the distribution of a sum of gaussians (variables that distribute normally) is itself a normal distribution with simply calculated expectations and variances.

Important inequalities Markov’s inequality: let x be a positive random variable with an expectation  = E(x). Then for every   0: the intuition is that as  is further away from the expectation, the probability for sample with x   decreases significantly: substituting y=(x- )2 we get Chevichev’s inequality: for every >0

Important inequalities Substituting in Chevichev’s inequality we get: does remind you of anything? Chernoff’s inequality is used to bound deviations from the expectation for a sum of Bernoulli’s variables: S=X1+…+Xm where p(Xi=1)=p. The additive bound (Heoffding) is: and the multiplictive bounds:

PMAI 2001/02 Targil 1: Basic Probability and Information theory