1 / 72

720 likes | 893 Views

Data Mining and Knowledge Acquizition — Chapter 5 III —. BIS 541 2011-2012 Spring. Chapter 7. Classification and Prediction. Bayesian Classification Model Based Reasoning Collaborative Filtering Classification accuracy. Bayesian Classification: Why?.

Download Presentation
## Data Mining and Knowledge Acquizition — Chapter 5 III —

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.
Content is provided to you AS IS for your information and personal use only.
Download presentation by click this link.
While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

**Data Mining and Knowledge Acquizition — Chapter 5 III —**BIS 541 2011-2012 Spring**Chapter 7. Classification and Prediction**• Bayesian Classification • Model Based Reasoning • Collaborative Filtering • Classification accuracy**Bayesian Classification: Why?**• Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. • Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured**Bayesian Theorem: Basics**• Let X be a data sample whose class label is unknown • Let H be a hypothesis that X belongs to class C • For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X • P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) • P(X): probability that sample data is observed • P(X|H) : probability of observing the sample X, given that the hypothesis holds**Bayesian Theorem**• Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem • Informally, this can be written as posterior =likelihood x prior / evidence • MAP (maximum posteriori) hypothesis • Practical difficulty: require initial knowledge of many probabilities, significant computational cost**Naïve Bayes Classifier**• A simplified assumption: attributes are conditionally independent: • The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C) • No dependence relation between attributes • Greatly reduces the computation cost, only count the class distribution. • Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci)**Example**• H X is an apple • P(H) priori probability that X is an apple • X observed data round and red • P(H/X) probability that X is an apple given that we observe that it is red and round • P(X/H) posteriori probability that a data is red and round given that it is an apple • P(X) priori probabilility that it is red and round**Applying Bayesian Theorem**• P(H/X)= P(H,X)/P(X) from Bayesian theorem • Similarly: • P(X/H) = P(H,X)/P(H) • P(H,X) = P(X/H)P(H) • hence • P(H/X)= P(X/H)P(H)/P(X) • calculate P(H/X) from • P(X/H),P(H),P(X)**Bayesian classification**• The classification problem may be formalized using a-posteriori probabilities: • P(Ci|X) = prob. that the sample tuple X=<x1,…,xk> is of class Ci. There are m classes Ci i =1 to m • E.g. P(class=N | outlook=sunny,windy=true,…) • Idea: assign to sampleXthe class labelCisuch thatP(Ci|X) is maximal • P(Ci|X)> P(Cj|X) 1<=j<=m ji**Estimating a-posteriori probabilities**• Bayes theorem: P(Ci|X) = P(X|Ci)·P(Ci) / P(X) • P(X) is constant for all classes • P(Ci) = relative freq of class Ci samples • Ci such that P(Ci|X) is maximum = Ci such that P(X|Ci)·P(Ci) is maximum • Problem: computing P(X|Ci) is unfeasible!**Naïve Bayesian Classification**• Naïve assumption: attribute independence P(x1,…,xk|Ci) = P(x1|Ci)·…·P(xk|Ci) • If i-th attribute is categorical:P(xi|Ci) is estimated as the relative freq of samples having value xi as i-th attribute in class Ci =sik/si . • If i-th attribute is continuous:P(xi|Ci) is estimated thru a Gaussian density function • Computationally easy in both cases**Training dataset**Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair)**Given the new customer**What is the probability of buying computer X=(age<=30 ,income =medium, student=yes,credit_rating=fair) Compute P(buy computer = yes/X) and P(buy computer = no/X) Decision: list as probabilities or chose the maximum conditional probability**Compute**P(buy computer = yes/X) = P(X/yes)*P(yes)/P(X) P(buy computer = no/X) P(X/no)*P(no)/P(X) Drop P(X) Decision: maximum of • P(X/yes)*P(yes) • P(X/no)*P(no)**Naïve Bayesian Classifier: Example**• Compute P(X/Ci) for each class • P(X/C = yes)*P(yes) P(age=“<30” | buys_computer=“yes”)* P(income=“medium” |buys_computer=“yes”)* P(credit_rating=“fair” | buys_computer=“yes”)* P(student=“yes” | buys_computer=“yes)* P(C =yes)**P(X/C = no)*P(no)**P(age=“<30” | buys_computer=“no”)* P(income=“medium” | buys_computer=“no”)* P(student=“yes” | buys_computer=“no”)* P(credit_rating=“fair” | buys_computer=“no”)* P(C=no)**Naïve Bayesian Classifier: Example**P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 P(buys_computer=“yes”)=9/14=0,643 P(buys_computer=“no”)=5/14=0,357**P(X|buys_computer=“yes”)**= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“yes”) * P(buys_computer=“yes”) =0.044*0.643=0.02 P(X|buys_computer=“no”) = 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|buys_computer=“no”) * P(buys_computer=“no”) =0.019*0.357=0.0007 X belongs to class “buys_computer=yes”**Class probabilities**• P(yes/X) = P(X/yes)*P(yes)/P(X) • P(no/X) = P(X/no)*P(no)/P(X) • What is P(X)? • P(X)= P(X/yes)*P(yes)+P(X/no)*P(no) • = 0.02 + 0.0007 • = 0.0207 • So • P(yes/X) = 0.02/0.0207 • P(no/X) = 0.0007/0.0207 • Hence • P(yes/X) + P(no/X) = 1**Naïve Bayesian Classifier: Comments**• Advantages : • Easy to implement • Good results obtained in most of the cases • Disadvantages • Assumption: class conditional independence , therefore loss of accuracy • Practically, dependencies exist among variables • E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc • Dependencies among these cannot be modeled by Naïve Bayesian Classifier • How to deal with these dependencies? • Bayesian Belief Networks**Y**Z P Bayesian Networks • Bayesian belief network allows a subset of the variables conditionally independent • A graphical model of causal relationships • Represents dependency among the variables • Gives a specification of joint probability distribution • Nodes: random variables • Links: dependency • X,Y are the parents of Z, and Y is the parent of P • No dependency between Z and P • Has no loops or cycles X**Bayesian Belief Network: An Example**Family History Smoker (FH, ~S) (~FH, S) (~FH, ~S) (FH, S) LC 0.7 0.8 0.5 0.1 LungCancer Emphysema ~LC 0.3 0.2 0.5 0.9 The conditional probability table for the variable LungCancer: Shows the conditional probability for each possible combination of its parents PositiveXRay Dyspnea Bayesian Belief Networks**Learning Bayesian Networks**• Several cases • Given both the network structure and all variables observable: learn only the CPTs • Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning • Network structure unknown, all variables observable: search through the model space to reconstruct graph topology • Unknown structure, all hidden variables: no good algorithms known for this purpose • D. Heckerman, Bayesian networks for data mining**Chapter 7. Classification and Prediction**• Bayesian Classification • Model Based Reasoning • Collaborative Filtering • Classification accuracy**Other Classification Methods**• k-nearest neighbor classifier • case-based reasoning • Genetic algorithm • Rough set approach • Fuzzy set approaches**Instance-Based Methods**• Instance-based learning: • Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified • Typical approaches • k-nearest neighbor approach • Instances represented as points in a Euclidean space. • Locally weighted regression • Constructs local approximation • Case-based reasoning • Uses symbolic representations and knowledge-based inference**Nearest Neighbor Approaches**Based on the concept of similarity Memory-Based Reasoning (MBR) – results are based on analogous situations in the past Collaborative Filtering – results use preferences in addition to analogous situations from the past**Memory-Based Reasoning (MBR)**• Our ability to reason from experience depends on our ability to recognize appropriate examples from the past… • Traffic patterns/routes • Movies • Food • We identify similar example(s) and apply what we know/learned to current situation • These similar examples in MBR are referred to as neighbors**MBR Applications**• Fraud detection • Customer response prediction • Medical treatments • Classifying responses – MBR can process free-text responses and assign codes**MBR Strengths**• Ability to use data “as is” – utilizes both a distance function and a combination function between data records to help determine how “neighborly” they are • Ability to adapt – adding new data makes it possible for MBR to learn new things • Good results without lengthy training**MBR Example – Rents in Tuxedo, NY**• Classify nearest neighbors based on descriptive variables – population & median home prices (not geography in this example) • Range midpoint in 2 neighbors is $1,000 & $1,250 so Tuxedo rent should be $1,125; 2nd method yields rent of $977 • Actual midpoint rent in Tuxedo turns out to be $1,250 (one method) and $907 in another.**MBR Challenges**• Choosing appropriate historical data for use in training • Choosing the most efficient way to represent the training data • Choosing the distance function, combination function, and the number of neighbors**Distance Function**• For numerical variables • Absolute value of distane |A-B| • Ex d(27,51)= |27-51|=24 • Square of differences (A-B)2 • Ex d(27,51)= (27-51)=242 • Normalized absolute value |A-B|/max differ • Ex d(27,51)= |27-51|/|27-52|=0,96 • Standardised absolute value • |A-B|/standard deviation • Categorical variables (similar to clusteing) • Ex gender • d(male,male)=0, d(female,female)=0 • d(male,female)=1, d(female,male)=1**Combining distance between variables**• Manhatten • Ex dsum(A,B)=dgender(A,B)+ dsalaryr(A,B)+ dage(A,B) • Normalized summation • Ex dsum(A,B)/max dsum • Euclidean • deuc(A,B)= • Sqrt(dgender(A,B)2+ dsalaryr(A,B) 2+ dage(A,B) 2)**The Combination Function**• For categorical target variables • Voting:Majority rule • Weighted voting • Weights inversly proportional to the distance • For numerical target variables • Take average • Weighted average • Weights inversly proportional to the distance**Collaborative Filtering**• Lots of human examples of this: • Best teachers • Best courses • Best restaurants (ambiance, service, food, price) • Recommend a dentist, mechanic, PC repair, blank CDs/DVDs, wines, B&Bs, etc… • CF is a variant of MBR particularly well suited to personalized recommendations**Collaborative Filtering**• Starts with a history of people’s personal preferences • Uses a distance function – people who like the same things are “close” • Uses “votes” which are weighted by distances, so close neighbor votes count more • Basically, judgments of a peer group are important**Collaborative Filtering**• Knowing that lots of people liked something is not sufficient… • Who liked it is also important • Friend whose past recommendations were good (or bad) • High profile person seems to influence • Collaborative Filtering automates this word-of-mouth everyday activity**Preparing Recommendations for Collaborative Filtering**• Building customer profile – ask new customer to rate selection of things • Comparing this new profile to other customers using some measure of similarity • Using some combination of the ratings from similar customers to predict what the new customer would select for items he/she has NOT yet rated**Collaborative Filtering Example**• What rating would Nathaniel give to Planet of the Apes? • Simon, distance 2, rated it -1 • Amelia, distance 4, rated it -4 • Using weighted average inverse to distance, it is predicted that he would rate it a -2 • =(0.5*-1 + 0.25*-4) / (0.5 + 0.25) • Nathaniel can certainly enter his rating after seeing the movie which could be close or far from the prediction**Chapter 7. Classification and Prediction**• Bayesian Classification • Model Based Reasoning • Collaborative Filtering • Classification accuracy**Holdout estimation**• What to do if the amount of data is limited? • The holdout method reserves a certain amount for testing and uses the remainder for training • Usually: one third for testing, the rest for training • Problem: the samples might not be representative • Example: class might be missing in the test data • Advanced version uses stratification • Ensures that each class is represented with approximately equal proportions in both subsets**Repeated holdout method**• Holdout estimate can be made more reliable by repeating the process with different subsamples • In each iteration, a certain proportion is randomly selected for training (possibly with stratificiation) • The error rates on the different iterations are averaged to yield an overall error rate • This is called the repeated holdout method • Still not optimum: the different test sets overlap • Can we prevent overlapping?**Cross-validation**• Cross-validation avoids overlapping test sets • First step: split data into k subsets of equal size • Second step: use each subset in turn for testing, the remainder for training • Called k-fold cross-validation • Often the subsets are stratified before the cross-validation is performed • The error estimates are averaged to yield an overall error estimate**More on cross-validation**• Standard method for evaluation: stratified ten-fold cross-validation • Why ten? • Extensive experiments have shown that this is the best choice to get an accurate estimate • There is also some theoretical evidence for this • Stratification reduces the estimate’s variance • Even better: repeated stratified cross-validation • E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)**Leave-One-Out cross-validation**• Leave-One-Out:a particular form of cross-validation: • Set number of folds to number of training instances • I.e., for n training instances, build classifier n times • Makes best use of the data • Involves no random subsampling • Very computationally expensive • (exception: NN)**Leave-One-Out-CV and stratification**• Disadvantage of Leave-One-Out-CV: stratification is not possible • It guarantees a non-stratified sample because there is only one instance in the test set! • Extreme example: random dataset split equally into two classes • Best inducer predicts majority class • 50% accuracy on fresh data • Leave-One-Out-CV estimate is 100% error!**The bootstrap**• CV uses sampling without replacement • The same instance, once selected, can not be selected again for a particular training/test set • The bootstrap uses sampling with replacement to form the training set • Sample a dataset of n instances n times with replacement to form a new datasetof n instances • Use this data as the training set • Use the instances from the originaldataset that don’t occur in the newtraining set for testing

More Related