1 / 27

Data Mining (and machine learning)

Data Mining (and machine learning). DM Lecture 3: Basic Statistics for data miners. Overview of My Lectures. All at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html 25/9 Overview of DM (and of these 8 lectures)

maryortiz
Download Presentation

Data Mining (and machine learning)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining(and machine learning) DM Lecture 3: Basic Statistics for data miners David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  2. Overview of My Lectures All at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html • 25/9 Overview of DM (and of these 8 lectures) • 02/10:     Data Cleaning - usually a necessary first step for large amounts of data • 09/10  Basic Statistics for Data Miners - essential knowledge, and very useful • 16/10 Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used much in industry • NO THURSDAY LECTURE OCTOBER 23rd • 30/10 Cluster Analysis and Clustering - simple algs that tell you much about the data • NO THURSDAY LECTURE November 6th • 13/11: Similarity and Correlation Measures - making sure you do clustering appropriately for the given data • 20/11: Regression - the simplest algorithm for predicting data/class values • 27/11: A Tour of Other Methods and their Essential Details - every important method you may learn about in future David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  3. Today you will see The mostimportant theorem in science David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  4. Statistical Data Mining • Definitions • – Population, Sample, Statistic • • Simple Statistics • – Mean, Mode, Median • – Range, Variance, Standard Deviation • • Probability Distributions • – Normal distribution David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  5. Fundamental Statistics Definitions • A Population is the total collection of all items/individuals/events under consideration • • A Sample is that part of a population which has been • observed or selected for analysis • E.g. all students is a population. Students at HWU is a sample; this class is a sample, etc … • • A Statistic is a measure which can be computed to describe a characteristic of the sample (e.g. the sample mean) • The reason for doing this is almost always to estimate (i.e. make a good guess) things about that characteristic in the population David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  6. E.g. • This class is a sample from the population of students at HWU • (it can also be considered as a sample of other populations – like what?) • One statistic of this sample is your mean weight. Suppose that is 65Kg. I.e. this is the sample mean. • Is 65Kg a good estimate for the mean weight of the population? • Another statistic: suppose 10% of you are married. Is this a good estimate for the proportion that are married in the population? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  7. Some Simple Statistics • The Mean (average) is the sum of the values in a sample divided by the number of values • The Median is the midpoint of the values in a sample (50% above; 50% below) after they have been ordered (e.g. from the smallest to the largest) • The Mode is the value that appears most frequently in a sample • The Range is the difference between the smallest and largest values in a sample • The Variance is a measure of the dispersion of the values in a sample – how closely the observations cluster around the mean of the sample • The Standard Deviation is the square root of the variance of a sample David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  8. Standard Deviation and other `moment’s • The m-th moment about the mean (μ) of a sample is: Where n is the number of items in the sample. • The first moment (m = 1) is 0! • The second moment (m = 2) is the variance • (and: square root of the variance is the standard deviation) • The third moment can be used in tests for skewness • The fourth moment can be used in tests for kurtosis David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  9. Distributions / Histograms A Normal (aka Gaussian) distribution (image from Mathworld) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  10. Distributions / Histograms Uniform distributions. Every possible value tends to be equally likely David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  11. Probability Distributions • If a population is expected to match a standard probability distribution then a wealth of statistical knowledge and results can be brought to bear on its analysis • Many standard statistical techniques are based on the assumption that the underlying distribution of a population is Normal (Gaussian) • Statistical tests have been developed to determine whether a sampled population is normally distributed David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  12. An important aside … This is the standard deviation of a sample This is slightly different, called the sample standard deviation Std is square root of Sample Std is square root of David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  13. A closer look at the normal distribution This is the ND with mean mu and std sigma David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  14. More than just a pretty bell shape Suppose mean of your sample is 1.8; and suppose std of your sample is 0.12 Theory tells us that if a population is Normal, the sample std is a fairly good guess at the population std So, we can say with some confidence, for example, that 99.7% of the populationlies between 1.44 and 2.16 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  15. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  16. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  17. The Central Limit Theorem Sir Francis Galton (Natural Inheritance, 1889) described the Central Limit Theorem as: “I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the "Law of Frequency of Error". The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” (from the wikipedia article) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  18. the more tosses of the • coin in each expt, the more • the closer the distribution • of heads is to a Normal • distribution. • Same with : • dist of sum of two dice • dists of heights, weights, • hours watching TV, etc … (from the wikipedia article) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  19. The Central Limit Theorem is this: • As more and more samples are taken from a population the distribution of the sample means conforms to a normal distribution • • The average of the samples more and more closely approximates the average of the entire population • • A very powerful and useful theorem • • The normal distribution is such a common and useful distribution that additional statistics have been developed to measure how closely a population conforms to it and to test for divergence from it due to skewness and kurtosis David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  20. Remember, MUCH of science relies on making guesses about populations The CLT helps us make the guesses reasonable rather than crazy. Assuming normal dist, the stats of a sample tells us lots about the stats of the population And, assuming normal dist helps us detect errors and outliers – how? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  21. Testing for Normality:the χ2 goodness-of-fit test This is the classic test of whether a data sample is normally distributed or not • We first group our data into k classes so that we can form a frequency distribution (the number of data items in each class) • We calculate the mean and standard deviation of our sample and define a normal distribution based on these values. • We now need to see if the number of data items in each of our classes matches the number predicted by the normal distribution David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  22. The normal distribution - with mean mu and std sigma This tells you how to calculate the probability (frequency) for any value x David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  23. The goodness of fit test simply measures the difference between the bars and the curve – adding up the squared difference for each bar. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  24. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  25. We can also test for skewness and kurtosis, using higher order moments David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  26. The take-home lesson (for those new to statistics) Your data contains 100 values for x, and you have good reason to believe that x is normally distributed. Thanks to the Central Limit Theorem, you can: • Make a lot of good estimates about the statistics of the population • Find outliers and spot other problems in the data It’s better to test for Normality though, and also test for skewness and kurtosis, so that you can say: “probably around 0.3% of people use their mobile for >8 hrs per day, although the sample is somewhat skewed to the left so this may be an underestimate …” David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  27. Next week – an actual Data Mining Algorithm! David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

More Related