Review

Review

Distributions

Distribution Definitions • Discrete Probability Distribution • Continuous Probability Distribution • Cumulative Distribution Function

Discrete Distribution • A r.v. X is discrete if it takes countably many values {x1,x2,….} • The probability function or probability mass function for X is given by • fX(x)= P(X=x) • From previous example

Continuous Distributions • A r.v. X is continuous if there exists a function fX such that

Example: Continuous Distribution • Suppose X has the pdf • This is the Uniform (0,1) distribution

Binomial Distribution • A coin flips Heads with probability p. Flip it n times and let X be the number of Heads. Assume flips are independent. • Let f(x) =P(X=x), then

Binomial Example • Let p =0.5; n = 5 then • In Matlab >>binopdf(4,5,0.5)

Normal Distribution • X has a Normal (Gaussian) distribution with parameters μ and σ if • X is standard Normal if μ =0 and σ =1. It is denoted as Z. • If X ~ N(μ, σ2) then

Normal Example • The number of spam emails received by a email server in a day follows a Normal Distribution N(1000,500). What is the probability of receiving 2000 spam emails in a day? • Let X be the number of spam emails received in a day. We want P(X = 2000)? • The answer is P(X=2000) = 0; • It is more meaningful to ask P(X >= 2000);

Normal Example • This is • In Matlab: >> 1 –normcdf(2000,1000,500) • The answer is 1 – 0.9772 = 0.0228 or 2.28% • This type of analysis is so common that there is a special name for it: cumulative distribution function F.

Conditional Independence • If A and B are independent then P(A|B)=P(A) • P(AB) = P(A|B)P(B) • Law of Total Probability.

Bayes Theorem

Question 1 • Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’?

Answer to Question 1 But what does G=F and D=Y mean? We have not even formally defined them.

Clustering

Types of Clusterings • A clustering is a set of clusters • Important distinction between hierarchical and partitionalsets of clusters • Partitional Clustering • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering • A set of nested clusters organized as a hierarchical tree

A Partitional Clustering Partitional Clustering Original Points

Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram

K-means Clustering • Partitional clustering approach • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest centroid • Number of clusters, K, must be specified • The basic algorithm is very simple

K-means Clustering – Details • Initial centroids are often chosen randomly. • Clusters produced vary from one run to another. • The centroid is (typically) the mean of the points in the cluster. • ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. • K-means will converge for common similarity measures mentioned above. • Most of the convergence happens in the first few iterations. • Often the stopping condition is changed to ‘Until relatively few points change clusters’ • Complexity is O( n * K * I * d ) • n = number of points, K = number of clusters, I = number of iterations, d = number of attributes

Evaluating K-means Clusters • Most common measure is Sum of Squared Error (SSE) • For each point, the error is the distance to the nearest cluster • To get SSE, we square these errors and sum them. • x is a data point in cluster Ci and mi is the representative point for cluster Ci • can show that micorresponds to the center (mean) of the cluster • Given two clusters, we can choose the one with the smallest error • One easy way to reduce SSE is to increase K, the number of clusters • A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

Hierarchical Clustering • Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram • A tree like diagram that records the sequences of merges or splits

Strengths of Hierarchical Clustering • Do not have to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level • They may correspond to meaningful taxonomies • Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

Hierarchical Clustering • Two main types of hierarchical clustering • Agglomerative: • Start with the points as individual clusters • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left • Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters) • Traditional hierarchical algorithms use a similarity or distance matrix • Merge or split one cluster at a time

Agglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm is straightforward • Compute the proximity matrix • Let each data point be a cluster • Repeat • Merge the two closest clusters • Update the proximity matrix • Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • Different approaches to defining the distance between clusters distinguish the different algorithms

EM Algorithm

Missing Data • We think of clustering as a problem of estimating missing data. • The missing data are the cluster labels. • Clustering is only one example of a missing data problem. Several other problems can be formulated as missing data problems.

Missing Data Problem • Let D = {x(1),x(2),…x(n)} be a set of n observations. • Let H = {z(1),z(2),..z(n)} be a set of n values of a hidden variable Z. • z(i) corresponds to x(i) • Assume Z is discrete.

EM Algorithm • The log-likelihood of the observed data is • Not only do we have to estimate  but also H • Let Q(H) be the probability distribution on the missing data.

EM Algorithm Inequality is because of Jensen’s Inequality. This means that the F(Q,) is a lower bound on l() Notice that the log of sums is become a sum of logs

EM Algorithm • The EM Algorithm alternates between maximizing F with respect to Q (theta fixed) and then maximizing F with respect to theta (Q fixed).

EM Algorithm • It turns out that the E-step is just • And, furthermore • Just plug-in

EM Algorithm • The M-step reduces to maximizing the first term with respect to  as there is no  in the second term.

EM Algorithm for Mixture of Normals Mixture of Normals E Step M-Step

What is Association Rule Mining? • Association rule mining finds • combinations of items that typically occur together in a database (market-basket analysis) • Sequences of items that occur frequently (sequential analysis) in a database • Originally introduced for Market-basket analysis -- useful for analysing purchasing behaviour of customers.

Market-Basket Analysis – Examples • Where should strawberries be placed to maximize their sale? • Services purchased together by telecommunication customers (e.g. broad band Internet, call forwarding, etc.) help determine how to bundle these services together to maximize revenue • Unusual combinations of insurance claims can be a sign of a fraud • Medical histories can give indications of complications based on combinations of treatments • Sport: analyzing game statistics (shots blocked, assists, and fouls) to gain competitive advantage • “When player X is on the floor, player Y’s shot accuracy decreases from 75% to 30%” • Bhandari et.al. (1997). Advanced Scout: data mining and knowledge discovery in NBA data, Data Mining and Knowledge Discovery, 1(1), pp.121-125

Support and Confidence - Example • What is the support and confidence of the following rules? • {Beer}{Bread} • {Bread, PeanutButter}{Jelly} ? Support(XY)=support(X Y) confidence(XY)=support(XY)/support(X)

Association Rule Mining Problem Definition • Given a set of transactions T={t1, t2, …,tn} and 2 thresholds; minsup and minconf, • Find all association rules XY with support  minsup and confidence  minconf • I.E: we want rules with high confidence and support • We call these rules interesting • We would like to • Design an efficient algorithm for mining association rules in large data sets • Develop an effective approach for distinguishing interesting rules from spurious ones

Generating Association Rules – Approach 1 (Naïve) • Enumerate all possible rules and select those of them that satisfy the minimum support and confidence thresholds • Not practical for large databases • For a given dataset with m items, the total number of possible rules is 3m-2m+1+1 (Why?*) • And most of these will be discarded! • We need a strategy for rule generation -- generate only the promising rules • rules that are likely to be interesting, or, more accurately, don’t generate rules that can’t be interesting. *hint: use inclusion-exclusion principle

Generating Association Rules –Approach 2 • What do these rules have in common? A,BC A,CB B,CA • The support of a rule XY depends only on the support of its itemset X Y Answer: they have the same support: support({A,B,C}) • Hence, a better approach: find Frequent itemsets first, then generate the rules • Frequent itemset is an itemset that occurs more than minsup times • If an itemset is infrequent, all the rules that contain it will have support<minsup and there is no need to generate them

Generating Association Rules –Approach 2 • 2 step-approach: Step 1: Generate frequent itemsets -- Frequent Itemset Mining (i.e. support  minsup) • e.g. {A,B,C} is frequent (so A,BC, A,CB and B,CA satisfy the minSup threshold). Step 2: From them, extract rules that satisfy the confidence threshold (i.e. confidence  minconf) • e.g. maybe only A,B C and C,BA are confident • Step 1 is the computationally difficult part (the next slides explain why, and a way to reduce the complexity….)

Frequent Itemset Generation (Step 1) – Brute-Force Approach • Enumerate all possible itemsets and scan the dataset to calculate the support for each of them • Example: I={a,b,c,d,e} Search space showing superset / subset relationships Given d items, there are 2d-1 possible (non-empty) candidate itemsets => not practical for large d

Frequent Itemset Generation (Step 1) -- Apriori Principle (1) A subset of any frequent itemset is also frequent • Example: If {c,d,e} is frequent then {c,d}, {c,e}, {d,e}, {c}, {d} are also frequent

Frequent Itemset Generation (Step 1) -- Apriori Principle (2) If an itemset is not frequent, a superset of it is also not frequent • Example: If we know that {a,b} is infrequent, the entire sub-graph can be pruned. • Ie: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,c,d}, {a,b,c,e}, {a,b,d,e} and {a,b,c,d}are infrequent

Recall the 2 Step process for Association Rule Mining Step 1: Find all frequent Itemsets So far: main ideas and concepts (Apriori principle). Later: algorithms Step 2: Generate the association rules from the frequent itemsets.

ARGen Algorithm (Step 2) • Generates interesting rules from the frequent itemsets • Already know the rules are frequent (Why?), just need to check confidence. ARGen algorithm for each frequent itemset F generate all non-empty subsets S. for each s in S do if confidence(s F-s) ≥ minConf then output rule s F-s end Example:F={a,b,c} S={{a,b}, {a,c}, {b,c}, {a}, {b}, {c}} rules output: {a,b} {c}, etc.

ARGen - Example • minsup=30%, minconf=50% • The set of frequent itemsets L={{Beer},{Bread}, {Milk}, {PeanutButter}, {Bread, PeanutButter}} • Only the last itemset from L consists of 2 nonempty subsets of frequent itemsets – Bread and PeanutButter. • => 2 rules will be generated

Bayes Classifier • A probabilistic framework for solving classification problems • Conditional Probability: • Bayes theorem:

Example of Bayes Theorem • Given: • A doctor knows that meningitis causes stiff neck 50% of the time • Prior probability of any patient having meningitis is 1/50,000 • Prior probability of any patient having stiff neck is 1/20 • If a patient has stiff neck, what’s the probability he/she has meningitis?

Review

Review

Presentation Transcript

Review

Review

Review

Review

Review

Review, REVIEW!

Review Notes Lecture Review

REVIEW, REVIEW, REVIEW!!

Review

Review

Review

Review

ACT Review Paragraphs Review

Review

review

Review

Geometry Review CRCT Review

review

Review Trust Review

Review