Clustering and Probability (Chap 7)

1 / 27

# Clustering and Probability (Chap 7) - PowerPoint PPT Presentation

Clustering and Probability (Chap 7). Review from Last Lecture. Defined the K-means problem for formalizing the notion of clustering. Discussed the K-means algorithm. Noted that the K-means algorithm was “quite good” in discovering “concepts” from data (based on features).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Clustering and Probability (Chap 7)' - gwyn

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Clustering and Probability (Chap 7)

Review from Last Lecture

Defined the K-means problem for formalizing the notion of clustering.

Discussed the K-means algorithm.

Noted that the K-means algorithm was “quite good” in discovering “concepts” from data (based on features).

Noted the important distinction between “attributes” and “features”.

Example of K-means -1

Let initial centroids be C1 = (1,1) and C2 = (2,1)

Example of K-means-2

C1: = (1,1); C2 = ((2+3+4)/3, (1+4+5)/3)) = (3,3.33)

Example of K-means-3

C1: = ((1+2)/2, (1+1)/2) = (1.5,1); C2 = ((3 +4)/2), (4+5)/2) = (3.5, 4.5)

Example of K-means-4

C1: = ((1+2)/2, (1+1)/2) = (1.5,1); C2 = ((3 +4)/2), (4+5)/2) = (3.5, 4.5)

Example: 2 Clusters

A(-1,2)

B(1,2)

c

c

4

(0,0)

C(-1,-2)

D(1,-2)

c

c

2

K-means Problem: Solution is (0,2) and (0,-2) and the clusters are {A,B} and

{C,D}

K-means Algorithm: Suppose the initial centroids are (-1,0) and (1,0) then

{A,C} and {B,D} end up as the two clusters.

Several other issues regarding clustering

How do you select the initial centroids?

How do you select the right number of clusters ?

How do you deal with non-Euclidean distance/similarity measures ?

Other approaches (like hierarchical, spectral etc.)

Curse of high-dimensionality.

Question

What should the “prediction” be for the flower ?

Prediction and Probability
• When we make predictions we should assign “probabilities” with the prediction.
• Examples:
• 20% chance it will rain tomorrow.
• 50% chance that the tumor is malignant.
• 60% chance that the stock market will fall by the end of the week.
• 30% that the next president of the United States will be a Democrat.
• 0.1% chance that the user will click on a banner-ad.
• How do we assign probabilities to complex events.. using smart data algorithms…and counting.
Probability Basics
• Probability is a deep topic…..but for most cases the rules are straightforward to apply..
• Terminology
• Experiment
• Sample Space
• Events
• Probability
• Rules of probability
• Conditional Probability
• Bayes Rule
Probability: Sample Space
• Consider an experimentand let S be the space of possible outcomes.
• Example:
• Experiment is tossing a coin; S={h,t}
• Experiment is rolling a pair of dice: S={(1,1),(1,2),…(6,6)}
• Experiment is a race consisting of three cars: 1,2 and 3. The sample space is {(1,2,3),(1,3,2),(2,1,3),(2,3,1),(3,1,2),(3,2,1)}
Probabilities

Let Sample Space S = {1,2,…m}

Consider numbers

pi is the probability that the outcome of the experiment is i.

Suppose we toss a fair coin. Sample space is S={h,t}. Then ph = 0.5 and pt = 0.5.

Probability
• Experiment: Will it rain or not in Sydney : S = {rain, no-rain}
• Prain = 138/365 =0.38; Pno-rain = 227/365
• Assigning (or rather how to) probabilities is a deep philosophical problem.
• What is the probability that the “green object standing outside my house is a burglar dressed in green.”
Probability
• An Event A is a set of possible outcomes of the experiment. Thus A is a subset of S.
• Let A be the event of getting a seven when we roll a pair of dice.
• A = {(1,6),(6,1),(2,5),(5,2),(4,3),(3,4) }
• P(A) = 6/36 = 1/6
• In general
Probability
• The sample space S and events are “sets”.
• P(S) = 1;
• P(Φ) = 0
• Often
• Complement:
Example
• Suppose the probability of raining today is 0.4 and tomorrow is also 0.4 and on both days is 0.1. What is the probability it does not rain on either day.
• S={(R,N), (R,R),(N,N),(N,R)}
• Let A be the event that it will rain today and B it will rain tomorrow. Then
• A ={(R,N), (R,R)} ; B={(N,R),(R,R)}
• Rain at least today or tomorrow:
• Will not rain on either day: 1 – 0.7 = 0.3
Conditional Probability
• One of the most important concepts in all of Data Mining and Machine Learning
• P(A|B) = P(AB)/P(B) ..assuming P(B) not equal 0.
• Conditional probability of A given B has occurred.
• Probability it will rain tomorrow given it has rained today.
• P(A|B) = P(AB)/(B) = 0.1/0.4 = ¼ = 0.25
• In general P(A|B) is not equal to P(B|A)
We need conditional probability to answer….

What should the “prediction” be for the flower ?

Bayes Rule
• P(A|B) = P(AB)/P(B); P(B|A) = P(BA)|P(A)
• Now P(AB) = P(BA)
• Thus P(A|B)P(B) = P(B|A)P(A)
• Thus P(A|B) = [P(B|A)P(A)]/[P(B)]
• This is called Bayes Rule
• Basis of almost all prediction
• Latest theories hypothesize that human memory and action is Bayes rule in action.
Bayes Rule

Prior

Posterior

Bayes Rule: Example

The ASX market goes up 60% of the days of a year. 40% of the time it stays the same or goes down. The day the ASX is up, there is a 50% chance that the Shanghai Index is up. On other days there is 30% chance that Shanghai goes up. Suppose The Shanghai market is up. What is the probability that ASX was up.

Define A1 as “ASX is up”; A2 is “ASX is not up”

Define S1 as “Shanghai is up”; S2 is “Shanghai is not up”

We want to calculate P(A1|S1) ?

P(A1) = 0.6; P(A2) = 0.4;

P(S1|A1) = 0.5; P(S1|A2) = 0.3

P(S2|A1) = 1 – P(S1|A1) = 0.5;

P(S2|A2) = 1 –P(S1|A2) = 0.7;

Bayes Rule: Example

We want to calculate P(A1|S1) ?

P(A1) = 0.6; P(A2) = 0.4;

P(S1|A1) = 0.5; P(S1|A2) = 0.3

P(S2|A1) = 1 – P(S1|A1) = 0.5;

P(S2|A2) = 1 –P(S1|A2) = 0.7;

P(A1|S1) = P(S1|A1)P(A1)/(P(S1))

How do we calculate P(S1) ?

Bayes Rule: Example

P(S1) = P(S1,A1) + P(S1,A2) [Key Step]

= P(S1|A1)P(A1) + P(S1|A2)P(A2)

= 0.5 x 0.6 + 0.3 x 0.4

= 0.42

Finally,

P(A1|S1) = P(S1|A1)P(A1)/P(S1)

= (0.5 x 0.6)/0.42 = 0.71

Example: Iris Flower

F=Flower; SL=Sepal Length; SW = Sepal Width;

PL=Petal Length; PW =Petal Width

Data

choose the

maximum

Example: Iris Flower
• So how do we compute: P(Data|F=A) ?
• This is a a non-trivial question…[subject to much research]
• How many times does “Data” appear in the “database” when F=A.
• In this case “Data” is a 4-dimensional “data vector.” Each component takes 3 values (small, medium, large). Thus number of combinations 3^4 = 81.
Example: Iris Flower
• Conditional Independence
• P(Data|F=A) = P(SL=Large,SW=Small,PL=Medium,PW=Small|F=A)
• ~= P(SL=Large|F=A)P(SW=Small|F=A)P(PL=Medium|A)P(PW=Small|A)
• The above is an assumption to make the “computation easier.”
• Surprisingly evidence suggest that it works reasonably well in practice.
• This prediction method (which exploits conditional independence) is called “Naïve Bayes Classifier.”