1 / 34

Statistical Techniques for Data Mining

Learn about linear regression, logistic regression, Bayes classifier, agglomerative clustering, and conceptual clustering in data mining.

portermoore
Download Presentation

Statistical Techniques for Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 11 Statistical Techniques

  2. Chapter Objectives • Understand when linear regression is an appropriate data mining technique. • Know how to perform linear regression with Microsoft Excel’s LINEST function. • Know that logistic regression can be used to build supervised learner models for datasets having a binary outcome. • Understand how Bayes classifier is able to build supervised models for datasets having categorical data, numeric data, or a combination of both data types. Chapter 11

  3. Chapter Objectives • Know how agglomerative clustering is applied partition data instances into disjoint clusters. • Understand that conceptual clustering is an unsupervised data mining technique that builds a concept hierarchy to partition data instances. • Know that the EM algorithm uses a statistical parameter adjustment technique to cluster data instances. • Understand the basic features that differentiate statistical and machine learning data mining methods Chapter 11

  4. Linear Regression Analysis Chapter 11

  5. Linear Regression Analysis Chapter 11

  6. Linear Regression Analysis Chapter 11

  7. Linear Regression Analysis Chapter 11

  8. Linear Regression Analysis Chapter 11

  9. Logistic Regression Chapter 11

  10. Logistic Regression Chapter 11

  11. Bayes Classifier Chapter 11

  12. Bayes Classifier Chapter 11

  13. Bayes Classifier Chapter 11

  14. Clustering Algorithms Chapter 11

  15. Clustering Algorithms Chapter 11

  16. Clustering Algorithms Chapter 11

  17. Clustering Algorithms Chapter 11

  18. Clustering Algorithms Chapter 11

  19. Clustering Algorithms Chapter 11

  20. Heuristics or Statistics? • Here is one way to categorize inductive problem-solving methods: • Query and visualization techniques • Machine learning techniques • Statistical techniques • Query and visualization techniques generally fall into one of three group: • Query tools • OLAP tools • Visualization tools Chapter 11

  21. Chapter Summary Data mining techniques come in many shapes and forms. A favorite statistical technique for estimation and prediction problems is linear regression. Linear regression attempts to model the variation in a dependent variable as a linear combination of one or more independent variables. Linear regression is an appropriate data mining strategy when the relationship between the dependent and independent variables is nearly linear. Microsoft Excel’s LINEST function provides an easy mechanism for performing multiple linear regression. Chapter 11

  22. Chapter Summary Linear regression is a poor choice when the outcome is binary. The problem lies in the fact that the value restriction placed on the dependent variable is not observed by the regression equation. That is, because linear regression produces a straight-line function, values of the dependent variable are unbounded in both the positive and negative directions. For the two-outcome case, logistic regression is a better choice. Logistic regression is a nonlinear regression technique that associates a conditional probability value with each data instance. Chapter 11

  23. Chapter Summary Bayes classifier offers a simple yet powerful supervised classification technique. The model assumes all input attributes to be of equal importance and independent of one another. Even though these assumptions are likely to be false, Bayes classifier still works quite well in practice. Bayes classifier can be applied to datasets containing both categorical and numeric data. Also, unlike many statistical classifiers, Bayes classifier can be applied to datasets containing a wealth of missing items. Chapter 11

  24. Chapter Summary Agglomerative clustering is a favorite unsupervised clustering technique. Agglomerative clustering begins by assuming each data instance represents its own cluster. Each iteration of the algorithm merges the most similar pair of clusters. Several options for computing instance and cluster similarity scores and cluster merging procedures exist. Also, when the data to be clustered is real-valued, defining a measure of instance similarity can be a challenge. One common approach is to use simple Euclidean distance. A widespread application of agglomerative clustering is its use as a prelude to other clustering techniques. Chapter 11

  25. Chapter Summary Conceptual clustering is an unsupervised technique that incorporates incremental learning to form a hierarchy of concepts. The concept hierarchy takes the form of a tree structure where the root node represents the highest level of concept generalization. Conceptual clustering systems are particularly appealing because the trees they form have been shown to consistently determine psychologically preferred levels in human classification hierarchies. Also, conceptual clustering systems lend themselves well to explaining their behavior. A major problem with conceptual clustering systems is that instance ordering can have a marked impact on the results of the clustering. A nonrepresentative ordering of data instances can lead to a less than optimal clustering. Chapter 11

  26. Chapter Summary The EM (expectation-maximization) algorithm is a statistical technique that makes use of the finite Gaussian mixtures model. The mixtures model assigns each individual data instance a probability that it would have a certain set of attribute values given it was a member of a specified cluster. The model assumes all attributes to be independent random variables. The EM algorithm is similar to the K-Means procedure in that a set of parameters are recomputed until a desired convergence value is achieved. A lack of explanation about what has been discovered is a problem with EM as it is with many clustering systems. Applying a supervised model to analyze the results of an unsupervised clustering is one technique to help explain the results of an EM clustering. Chapter 11

  27. Key Terms A priori probability. The probability a hypothesis is true lacking evidence to support or reject the hypothesis. Agglomerative clustering. An unsupervised technique where each data instance initially represents its own cluster. Successive iterations of the algorithm merge pairs of highly similar clusters until all instance become members of a single cluster. In the last step, a decision is made about which clustering is a best final result. Basic-level nodes. The nodes in a concept hierarchy that represent concepts easily identified by humans. Chapter 11

  28. Key Terms Bayes classifier. A supervised learning approach that classifies new instances by using Bayes theorem. Bayes theorem.The probability of a hypothesis given some evidence is equal to the probability of the evidence given the hypothesis, times the probability of the hypothesis, divided by the probability of the evidence. Bayesian Information Criterion (BIC). The BIC gives the posterior odds for one data mining model against another model assuming neither model is favored initially. Chapter 11

  29. Key Terms Category utility. An unsupervised evaluation function that measures the gain in the “expected number” of correct attribute-value predictions for a specific object if it were placed within a given category or cluster. Coefficient of determination. For a regression analysis, the correlation between actual and estimated values for the dependent variable. Concept hierarchy. A tree structure where each node of the tree represents a concept at some level of abstraction. Nodes toward the top of the tree are the most general. Leaf nodes represent individual data instances. Chapter 11

  30. Key Terms Conceptual clustering. An incremental unsupervised clustering method that creates a concept hierarchy from a set of input instances. Conditional probability. The conditional probability of evidence E given hypothesis H denoted by P(E | H), is the probability E is true given H is true. Incremental learning. A form of learning that is supported in an unsupervised environment where instances are presented sequentially. As each new instance is seen, the learning model is modified to reflect the addition of the new instance. Chapter 11

  31. Key Terms Linear regression. A statistical technique that models the variation in a numeric dependent variable as a linear combination of one or several independent variables. Logistic regression. A nonlinear regression technique for problems having a binary outcome. A created regression equation limits the values of the output attribute to values between 0 and 1.This allows output values to represent a probability of class membership. Chapter 11

  32. Key Terms Logit. The natural logarithm of the odds ratio p(y = 1| x)/[1-p(y = 1| x)]. p(y = 1| x) is the conditional probability that the value of the linear regression equation determined by feature vector x is 1. Mixture. A set of n probability distributions where each distribution represent a cluster. Model tree. A decision tree where each leaf node contains a linear regression equation. Chapter 11

  33. Key Terms Regression. The process of developing an expression that predicts a numeric output value. Regression tree. A decision tree where leaf nodes contain averaged numeric values. Simple linear regression. A regression equation with a single independent variable. Slope-intercept form. A linear equation of the form y = ax + b where a is the slope of the line and b is the y-intercept. Chapter 11

  34. THE END Chapter 11

More Related