1 / 80

Data Mining (and machine learning)

Data Mining (and machine learning). Lecture 2: Some common data processing tasks. Today. About classification A simple classification method (1-NN) Normalisation of instances and fields Discretization of fields Coursework B. Some notation. A dataset has N records {R 1 ,R 2 ,…,R N },

barid
Download Presentation

Data Mining (and machine learning)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining(and machine learning) Lecture 2: Some common data processing tasks David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  2. Today • About classification • A simple classification method (1-NN) • Normalisation of instances and fields • Discretization of fields • Coursework B David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  3. Some notation A dataset has N records {R1,R2,…,RN}, … each of which has F fields {A1,A2,…,AF}. Field AF is the target field – the one we want to predict. v(i,j) is the value of field Aj in record Ri So, for example, what is David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  4. Some more terminology Sometimes the values of a field are numeric. We call this a numeric field (!) Sometimes the values of a field are words or categories. E.g. {red, blue, green} {true, false} {male, female} {low, medium, high} this type of field is called a categorical field, or a nominal field. David Corne, , Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  5. So, what is this mostly about? You have some data: David Corne, , Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  6. So, what is this mostly about? You have some data: You get a new line of data, but it is not classified David Corne Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  7. So, what is this mostly about? You have some data: You get a new line of data, but it is not classified You need to predict the class value David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  8. Broadly how it’s done In order to be able to predict the class value of any new line of data that comes along, this is what happens: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  9. Broadly how it’s done In order to be able to predict the class value of any new line of data that comes along, this is what happens: Get your already-classified dataset Run a machine learning method on these data, which produces a classifier Use the classifier to make predictions of the class value for new instances David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  10. E.g. here is some data we saw before If we run a machine learning method like RIPPER, the classifer that comes out is a rule (or a set of rules), e.g.: If (A1<5) AND (A5<2) THEN class is A If (A2>1) THEN class is B David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  11. E.g. here is some data we saw before If instead we do machine learning with a Neural Network, or Naive Bayes, or Regression, etc …, then the output classifier is (more or less) a thresholded mathematical function of the attribute values. The type of function varies a lot between methods, but, it might be e.g.: With the convention that: if this is < 0 the class is A, otherwise B David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  12. Or the classifier could be a decision tree, such as: A1>2 ? no yes A4>7 ? A4<5 ? yes no Class is A Class is B Class is B David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  13. Applications • Just a reminder: you should already be developing an awareness • of this from previous lectures, from browsing data repositories, and • from using your imagination. • Business, Commerce, Engineering, Medicine, and Science provide • endless important applications for this process, in which: • there are existing classified datasets, that we can learn from • and for which producing a classifier that can predict class • values for new data instances is highly valuable. David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  14. An Important Issue Overfitting: It is very possible to learn a classifier that gives accurate performance on the existing classified dataset, but which performs very poorly in its predictions on new, previously unseen data. I will have more to say on overfitting, and how to combat it, in later lectures. David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  15. The main issue for today Dataset preparation: There are various things we can and should do to our dataset so that we make it easier, or indeed possible, to apply our chosen machine learning method. E.g. we may want to apply the ID3 Decision Tree learning algorithm – but this only works with categorical data. We may want to apply a Neural Network, but this needs the data to be all numeric, and for all values to be relatively small (e.g. between -1 and 1). Etc.. David Corne Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  16. Useful interlude: 1-NN classification • Almost the simplest possible classification method. • The `classifier’ is just the dataset itself • Predict the class of a new instance new by finding the instance c in the dataset that is closest to new, and predicting that it will have the same class value as c David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  17. How to work out distance between two instances? Depends on the data, and on your common sense. More on that later; but for now let’s say we are only dealing with numeric data. One way to work out distance is the SSD (sum-squared distance) If you have two records a and b, then the distance between them can be given by: Just add up the squared differences of the fields, omitting the class field. Why don’t I take the square root of this? David Corne, , Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  18. Visualisation of 1-NN Suppose you have data with two numeric fields and one class field, which is either red or green. We can treat the data as 2D points, coloured by the class. Suppose this is the pre-classified dataset: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  19. Visualisation of 1-NN Suppose you have data with two numeric fields and one class field, which is either red or green. We can treat the data as 2D points, coloured by the class. Suppose this is the pre-classified dataset: A new unclassified instance comes along – what is the class? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  20. Visualisation of 1-NN Suppose you have data with two numeric fields and one class field, which is either red or green. We can treat the data as 2D points, coloured by the class. Suppose this is the pre-classified dataset: The way you answer this visually is very similar to the way 1-NN works David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  21. Life is difficult / interesting Just an aside to point out that, in many interesting and important classification problems, the situation is more like this: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  22. The performance of 1-NN classification Calculate it for any dataset of R records like this: Initialise accuracy to 0. For each record r Find the closest other record, o, in the dataset If the class value of o is the same as the class value of r, than add 1 to accuracy Return 100 * accuracy/R Thereby, you obtain a percentage accuracy result David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  23. Scaling, Normalisation and Discretization So we will learn these things: Ways to do normalisation and scaling (which may be necessary or sensible, depending on how the data were generated, and on what DM/ML methods you want to use) Ways to do discretization (for turning numeric fields into categorical fields) David Corne, , Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  24. Normalisation You have data about word-counts for specific words on web pages. You want to be able to predict whether the page is about sport or business. What category would 1-NN predict for the 4th record? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  25. Normalisation The closest record to the 4th is the 3rd, so it would be predicted to be sport. But this is probably wrong. Why? David Corne, r, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  26. Normalisation The closest record to the 4th is the 3rd, so it would be predicted to be sport. But this is probably wrong. Why? How could you pre-process the data to help ML methods? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  27. Record-wise Normalisation In these data, the sensible thing to do is transform each record so that it is normalised by the total number of words. This makes them more comparable, each providing a “fingerprint” in terms of the relative proportions of the probe words. Sometimes this is necessary, sometimes it is useful – it just takes common sense. David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  28. Record-wise normalised version Each row is simply scaled so that he numeric fields add up to 1 1-NN now shows record 4 closer to 1 and 2 David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  29. Record-wise Normalisation Using the notation in slide 3 – given a dataset of N numeric fields, how would you represent record-wise normalization of record r ? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  30. Min-Max Normalisation This is columnwise normalisation, done separately for each field. For each numeric attribute, scale it so that each value is in a specific range [a, b]. E.g. usually [0,1], sometimes [-1, 1], etc. E.g. if we use [0,1], we do this for each record r. Where min(i) is the smallest value for field i in the dataset, max(i) is the largest, and range(i) is max(i) – min(i). David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  31. When min-max normalisation might be useful David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  32. What is the closest record to record 5? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  33. What is the closest record to record 5? Will this be a good predictor for DMML Mark? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  34. What is the closest record to record 5? Will this be a good predictor for DMML Mark? Which field will be most important for predicting Mark? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  35. What is the closest record to record 5? Will this be a good predictor for DMML Mark? Which field will be most important for predicting Mark? How does min-max normalisation help? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  36. Discretization We can’t run the ID3 decision tree algorithm on this. David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  37. Discretization We can run the ID3 decision tree algorithm on this. David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  38. Discretization Discretization is simply a process that converts a numerical field into a class field. How might you do this? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  39. Equal Width Binning (EWB) EWB(5), for example, means you discretize into 5 values, where each value has equal `width’. If field f values range from 0 to 100, then, then each bin has width 20. In the converted dataset, we can just label this bins 1,2,…,5, or we can use appropriate linguistic terms, such as: {very low, low, medium, high, very high} David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  40. Equal Frequency Binning (EFB) EFB(3), for example, means you discretize into 3 values, where each bin covers around the same amount of data. David Corne, , Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  41. Contrived Example for EWB(2) David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  42. Other discretization methods There are others, which almost all differ from EWB and EFB in one respect: they use class value information in attempt to find bins that make good sense with regard to classification. We may look at those later David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  43. Finally: Missing values Often, a dataset has missing values for some fields in some records. E.g. a certain blood test was not taken for some patients. A questionnaire response for some question was unreadable, etc… David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  44. Coursework B There are several ways to deal with missing values. Find out yourself Do some research with google, or google scholar (or your favourite other search engine). Write a report of only ~100 words, entitled “Methods for dealing with missing values in datasets” Provide at least 2 relevant references (to papers or URLs) at the end of your report. The whole thing should only take 30mins—1hr David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  45. Next week Basic statistics Histograms A little more normalisation and discretization Coursework 1 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  46. Extra slides on Data Quality / Data Cleaning I am not teaching Data Quality / Data Cleaning as part of F21DL this year. But here are some slides you may be interested in anyway David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  47. On Data Quality Suppose you have a database sitting in front of you, and I ask ``Is it a good quality database?’’ What is your answer? What does quality depend on? Note: this is about the data themselves, not the system in use to access it. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  48. A Conventional Definition of Data Quality Good quality data are: Accurate, Complete, Unique, Up-to-date, and Consistent ; meaning … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  49. A Conventional Definition of Data Quality, continued … Accurate: This refers to how the data were recorded in the first place. What might be the inaccurately recorded datum in the following table? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  50. A Conventional Definition of Data Quality, continued … Complete: This refers to whether or not the database really contains everything it is supposed to contain. E.g. a patient’s medical records should contain references to all medication prescribed to date for that patient. The BBC TV Licensing DB should contain an entry for every address in the country. Does it? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

More Related