770 likes | 976 Views
Review. Distributions. Distribution Definitions. Discrete Probability Distribution Continuous Probability Distribution Cumulative Distribution Function. Discrete Distribution. A r.v. X is discrete if it takes countably many values {x 1 ,x 2 ,….}
E N D
Distribution Definitions • Discrete Probability Distribution • Continuous Probability Distribution • Cumulative Distribution Function
Discrete Distribution • A r.v. X is discrete if it takes countably many values {x1,x2,….} • The probability function or probability mass function for X is given by • fX(x)= P(X=x) • From previous example
Continuous Distributions • A r.v. X is continuous if there exists a function fX such that
Example: Continuous Distribution • Suppose X has the pdf • This is the Uniform (0,1) distribution
Binomial Distribution • A coin flips Heads with probability p. Flip it n times and let X be the number of Heads. Assume flips are independent. • Let f(x) =P(X=x), then
Binomial Example • Let p =0.5; n = 5 then • In Matlab >>binopdf(4,5,0.5)
Normal Distribution • X has a Normal (Gaussian) distribution with parameters μ and σ if • X is standard Normal if μ =0 and σ =1. It is denoted as Z. • If X ~ N(μ, σ2) then
Normal Example • The number of spam emails received by a email server in a day follows a Normal Distribution N(1000,500). What is the probability of receiving 2000 spam emails in a day? • Let X be the number of spam emails received in a day. We want P(X = 2000)? • The answer is P(X=2000) = 0; • It is more meaningful to ask P(X >= 2000);
Normal Example • This is • In Matlab: >> 1 –normcdf(2000,1000,500) • The answer is 1 – 0.9772 = 0.0228 or 2.28% • This type of analysis is so common that there is a special name for it: cumulative distribution function F.
Conditional Independence • If A and B are independent then P(A|B)=P(A) • P(AB) = P(A|B)P(B) • Law of Total Probability.
Question 1 • Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’?
Answer to Question 1 But what does G=F and D=Y mean? We have not even formally defined them.
Types of Clusterings • A clustering is a set of clusters • Important distinction between hierarchical and partitionalsets of clusters • Partitional Clustering • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering • A set of nested clusters organized as a hierarchical tree
A Partitional Clustering Partitional Clustering Original Points
Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram
K-means Clustering • Partitional clustering approach • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest centroid • Number of clusters, K, must be specified • The basic algorithm is very simple
K-means Clustering – Details • Initial centroids are often chosen randomly. • Clusters produced vary from one run to another. • The centroid is (typically) the mean of the points in the cluster. • ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. • K-means will converge for common similarity measures mentioned above. • Most of the convergence happens in the first few iterations. • Often the stopping condition is changed to ‘Until relatively few points change clusters’ • Complexity is O( n * K * I * d ) • n = number of points, K = number of clusters, I = number of iterations, d = number of attributes
Evaluating K-means Clusters • Most common measure is Sum of Squared Error (SSE) • For each point, the error is the distance to the nearest cluster • To get SSE, we square these errors and sum them. • x is a data point in cluster Ci and mi is the representative point for cluster Ci • can show that micorresponds to the center (mean) of the cluster • Given two clusters, we can choose the one with the smallest error • One easy way to reduce SSE is to increase K, the number of clusters • A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
Hierarchical Clustering • Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram • A tree like diagram that records the sequences of merges or splits
Strengths of Hierarchical Clustering • Do not have to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level • They may correspond to meaningful taxonomies • Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)
Hierarchical Clustering • Two main types of hierarchical clustering • Agglomerative: • Start with the points as individual clusters • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left • Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters) • Traditional hierarchical algorithms use a similarity or distance matrix • Merge or split one cluster at a time
Agglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm is straightforward • Compute the proximity matrix • Let each data point be a cluster • Repeat • Merge the two closest clusters • Update the proximity matrix • Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • Different approaches to defining the distance between clusters distinguish the different algorithms
Missing Data • We think of clustering as a problem of estimating missing data. • The missing data are the cluster labels. • Clustering is only one example of a missing data problem. Several other problems can be formulated as missing data problems.
Missing Data Problem • Let D = {x(1),x(2),…x(n)} be a set of n observations. • Let H = {z(1),z(2),..z(n)} be a set of n values of a hidden variable Z. • z(i) corresponds to x(i) • Assume Z is discrete.
EM Algorithm • The log-likelihood of the observed data is • Not only do we have to estimate but also H • Let Q(H) be the probability distribution on the missing data.
EM Algorithm Inequality is because of Jensen’s Inequality. This means that the F(Q,) is a lower bound on l() Notice that the log of sums is become a sum of logs
EM Algorithm • The EM Algorithm alternates between maximizing F with respect to Q (theta fixed) and then maximizing F with respect to theta (Q fixed).
EM Algorithm • It turns out that the E-step is just • And, furthermore • Just plug-in
EM Algorithm • The M-step reduces to maximizing the first term with respect to as there is no in the second term.
EM Algorithm for Mixture of Normals Mixture of Normals E Step M-Step
What is Association Rule Mining? • Association rule mining finds • combinations of items that typically occur together in a database (market-basket analysis) • Sequences of items that occur frequently (sequential analysis) in a database • Originally introduced for Market-basket analysis -- useful for analysing purchasing behaviour of customers.
Market-Basket Analysis – Examples • Where should strawberries be placed to maximize their sale? • Services purchased together by telecommunication customers (e.g. broad band Internet, call forwarding, etc.) help determine how to bundle these services together to maximize revenue • Unusual combinations of insurance claims can be a sign of a fraud • Medical histories can give indications of complications based on combinations of treatments • Sport: analyzing game statistics (shots blocked, assists, and fouls) to gain competitive advantage • “When player X is on the floor, player Y’s shot accuracy decreases from 75% to 30%” • Bhandari et.al. (1997). Advanced Scout: data mining and knowledge discovery in NBA data, Data Mining and Knowledge Discovery, 1(1), pp.121-125
Support and Confidence - Example • What is the support and confidence of the following rules? • {Beer}{Bread} • {Bread, PeanutButter}{Jelly} ? Support(XY)=support(X Y) confidence(XY)=support(XY)/support(X)
Association Rule Mining Problem Definition • Given a set of transactions T={t1, t2, …,tn} and 2 thresholds; minsup and minconf, • Find all association rules XY with support minsup and confidence minconf • I.E: we want rules with high confidence and support • We call these rules interesting • We would like to • Design an efficient algorithm for mining association rules in large data sets • Develop an effective approach for distinguishing interesting rules from spurious ones
Generating Association Rules – Approach 1 (Naïve) • Enumerate all possible rules and select those of them that satisfy the minimum support and confidence thresholds • Not practical for large databases • For a given dataset with m items, the total number of possible rules is 3m-2m+1+1 (Why?*) • And most of these will be discarded! • We need a strategy for rule generation -- generate only the promising rules • rules that are likely to be interesting, or, more accurately, don’t generate rules that can’t be interesting. *hint: use inclusion-exclusion principle
Generating Association Rules –Approach 2 • What do these rules have in common? A,BC A,CB B,CA • The support of a rule XY depends only on the support of its itemset X Y Answer: they have the same support: support({A,B,C}) • Hence, a better approach: find Frequent itemsets first, then generate the rules • Frequent itemset is an itemset that occurs more than minsup times • If an itemset is infrequent, all the rules that contain it will have support<minsup and there is no need to generate them
Generating Association Rules –Approach 2 • 2 step-approach: Step 1: Generate frequent itemsets -- Frequent Itemset Mining (i.e. support minsup) • e.g. {A,B,C} is frequent (so A,BC, A,CB and B,CA satisfy the minSup threshold). Step 2: From them, extract rules that satisfy the confidence threshold (i.e. confidence minconf) • e.g. maybe only A,B C and C,BA are confident • Step 1 is the computationally difficult part (the next slides explain why, and a way to reduce the complexity….)
Frequent Itemset Generation (Step 1) – Brute-Force Approach • Enumerate all possible itemsets and scan the dataset to calculate the support for each of them • Example: I={a,b,c,d,e} Search space showing superset / subset relationships Given d items, there are 2d-1 possible (non-empty) candidate itemsets => not practical for large d
Frequent Itemset Generation (Step 1) -- Apriori Principle (1) A subset of any frequent itemset is also frequent • Example: If {c,d,e} is frequent then {c,d}, {c,e}, {d,e}, {c}, {d} are also frequent
Frequent Itemset Generation (Step 1) -- Apriori Principle (2) If an itemset is not frequent, a superset of it is also not frequent • Example: If we know that {a,b} is infrequent, the entire sub-graph can be pruned. • Ie: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,c,d}, {a,b,c,e}, {a,b,d,e} and {a,b,c,d}are infrequent
Recall the 2 Step process for Association Rule Mining Step 1: Find all frequent Itemsets So far: main ideas and concepts (Apriori principle). Later: algorithms Step 2: Generate the association rules from the frequent itemsets.
ARGen Algorithm (Step 2) • Generates interesting rules from the frequent itemsets • Already know the rules are frequent (Why?), just need to check confidence. ARGen algorithm for each frequent itemset F generate all non-empty subsets S. for each s in S do if confidence(s F-s) ≥ minConf then output rule s F-s end Example:F={a,b,c} S={{a,b}, {a,c}, {b,c}, {a}, {b}, {c}} rules output: {a,b} {c}, etc.
ARGen - Example • minsup=30%, minconf=50% • The set of frequent itemsets L={{Beer},{Bread}, {Milk}, {PeanutButter}, {Bread, PeanutButter}} • Only the last itemset from L consists of 2 nonempty subsets of frequent itemsets – Bread and PeanutButter. • => 2 rules will be generated
Bayes Classifier • A probabilistic framework for solving classification problems • Conditional Probability: • Bayes theorem:
Example of Bayes Theorem • Given: • A doctor knows that meningitis causes stiff neck 50% of the time • Prior probability of any patient having meningitis is 1/50,000 • Prior probability of any patient having stiff neck is 1/20 • If a patient has stiff neck, what’s the probability he/she has meningitis?