1 / 27

# Clustering and Probability (Chap 7) - PowerPoint PPT Presentation

Clustering and Probability (Chap 7). Review from Last Lecture. Defined the K-means problem for formalizing the notion of clustering. Discussed the K-means algorithm. Noted that the K-means algorithm was “quite good” in discovering “concepts” from data (based on features).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Clustering and Probability (Chap 7)' - gwyn

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Clustering and Probability (Chap 7)

Defined the K-means problem for formalizing the notion of clustering.

Discussed the K-means algorithm.

Noted that the K-means algorithm was “quite good” in discovering “concepts” from data (based on features).

Noted the important distinction between “attributes” and “features”.

Let initial centroids be C1 = (1,1) and C2 = (2,1)

C1: = (1,1); C2 = ((2+3+4)/3, (1+4+5)/3)) = (3,3.33)

C1: = ((1+2)/2, (1+1)/2) = (1.5,1); C2 = ((3 +4)/2), (4+5)/2) = (3.5, 4.5)

C1: = ((1+2)/2, (1+1)/2) = (1.5,1); C2 = ((3 +4)/2), (4+5)/2) = (3.5, 4.5)

A(-1,2)

B(1,2)

c

c

4

(0,0)

C(-1,-2)

D(1,-2)

c

c

2

K-means Problem: Solution is (0,2) and (0,-2) and the clusters are {A,B} and

{C,D}

K-means Algorithm: Suppose the initial centroids are (-1,0) and (1,0) then

{A,C} and {B,D} end up as the two clusters.

How do you select the initial centroids?

How do you select the right number of clusters ?

How do you deal with non-Euclidean distance/similarity measures ?

Other approaches (like hierarchical, spectral etc.)

Curse of high-dimensionality.

What should the “prediction” be for the flower ?

• When we make predictions we should assign “probabilities” with the prediction.

• Examples:

• 20% chance it will rain tomorrow.

• 50% chance that the tumor is malignant.

• 60% chance that the stock market will fall by the end of the week.

• 30% that the next president of the United States will be a Democrat.

• 0.1% chance that the user will click on a banner-ad.

• How do we assign probabilities to complex events.. using smart data algorithms…and counting.

• Probability is a deep topic…..but for most cases the rules are straightforward to apply..

• Terminology

• Experiment

• Sample Space

• Events

• Probability

• Rules of probability

• Conditional Probability

• Bayes Rule

• Consider an experimentand let S be the space of possible outcomes.

• Example:

• Experiment is tossing a coin; S={h,t}

• Experiment is rolling a pair of dice: S={(1,1),(1,2),…(6,6)}

• Experiment is a race consisting of three cars: 1,2 and 3. The sample space is {(1,2,3),(1,3,2),(2,1,3),(2,3,1),(3,1,2),(3,2,1)}

Let Sample Space S = {1,2,…m}

Consider numbers

pi is the probability that the outcome of the experiment is i.

Suppose we toss a fair coin. Sample space is S={h,t}. Then ph = 0.5 and pt = 0.5.

• Experiment: Will it rain or not in Sydney : S = {rain, no-rain}

• Prain = 138/365 =0.38; Pno-rain = 227/365

• Assigning (or rather how to) probabilities is a deep philosophical problem.

• What is the probability that the “green object standing outside my house is a burglar dressed in green.”

• An Event A is a set of possible outcomes of the experiment. Thus A is a subset of S.

• Let A be the event of getting a seven when we roll a pair of dice.

• A = {(1,6),(6,1),(2,5),(5,2),(4,3),(3,4) }

• P(A) = 6/36 = 1/6

• In general

• The sample space S and events are “sets”.

• P(S) = 1;

• P(Φ) = 0

• Often

• Complement:

• Suppose the probability of raining today is 0.4 and tomorrow is also 0.4 and on both days is 0.1. What is the probability it does not rain on either day.

• S={(R,N), (R,R),(N,N),(N,R)}

• Let A be the event that it will rain today and B it will rain tomorrow. Then

• A ={(R,N), (R,R)} ; B={(N,R),(R,R)}

• Rain at least today or tomorrow:

• Will not rain on either day: 1 – 0.7 = 0.3

• One of the most important concepts in all of Data Mining and Machine Learning

• P(A|B) = P(AB)/P(B) ..assuming P(B) not equal 0.

• Conditional probability of A given B has occurred.

• Probability it will rain tomorrow given it has rained today.

• P(A|B) = P(AB)/(B) = 0.1/0.4 = ¼ = 0.25

• In general P(A|B) is not equal to P(B|A)

What should the “prediction” be for the flower ?

• P(A|B) = P(AB)/P(B); P(B|A) = P(BA)|P(A)

• Now P(AB) = P(BA)

• Thus P(A|B)P(B) = P(B|A)P(A)

• Thus P(A|B) = [P(B|A)P(A)]/[P(B)]

• This is called Bayes Rule

• Basis of almost all prediction

• Latest theories hypothesize that human memory and action is Bayes rule in action.

Prior

Posterior

The ASX market goes up 60% of the days of a year. 40% of the time it stays the same or goes down. The day the ASX is up, there is a 50% chance that the Shanghai Index is up. On other days there is 30% chance that Shanghai goes up. Suppose The Shanghai market is up. What is the probability that ASX was up.

Define A1 as “ASX is up”; A2 is “ASX is not up”

Define S1 as “Shanghai is up”; S2 is “Shanghai is not up”

We want to calculate P(A1|S1) ?

P(A1) = 0.6; P(A2) = 0.4;

P(S1|A1) = 0.5; P(S1|A2) = 0.3

P(S2|A1) = 1 – P(S1|A1) = 0.5;

P(S2|A2) = 1 –P(S1|A2) = 0.7;

We want to calculate P(A1|S1) ?

P(A1) = 0.6; P(A2) = 0.4;

P(S1|A1) = 0.5; P(S1|A2) = 0.3

P(S2|A1) = 1 – P(S1|A1) = 0.5;

P(S2|A2) = 1 –P(S1|A2) = 0.7;

P(A1|S1) = P(S1|A1)P(A1)/(P(S1))

How do we calculate P(S1) ?

P(S1) = P(S1,A1) + P(S1,A2) [Key Step]

= P(S1|A1)P(A1) + P(S1|A2)P(A2)

= 0.5 x 0.6 + 0.3 x 0.4

= 0.42

Finally,

P(A1|S1) = P(S1|A1)P(A1)/P(S1)

= (0.5 x 0.6)/0.42 = 0.71

F=Flower; SL=Sepal Length; SW = Sepal Width;

PL=Petal Length; PW =Petal Width

Data

choose the

maximum

• So how do we compute: P(Data|F=A) ?

• This is a a non-trivial question…[subject to much research]

• How many times does “Data” appear in the “database” when F=A.

• In this case “Data” is a 4-dimensional “data vector.” Each component takes 3 values (small, medium, large). Thus number of combinations 3^4 = 81.

• Conditional Independence

• P(Data|F=A) = P(SL=Large,SW=Small,PL=Medium,PW=Small|F=A)

• ~= P(SL=Large|F=A)P(SW=Small|F=A)P(PL=Medium|A)P(PW=Small|A)

• The above is an assumption to make the “computation easier.”

• Surprisingly evidence suggest that it works reasonably well in practice.

• This prediction method (which exploits conditional independence) is called “Naïve Bayes Classifier.”