Data Mining (and machine learning)

Data Mining(and machine learning) Lecture 2: Some common data processing tasks David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Today • About classification • A simple classification method (1-NN) • Normalisation of instances and fields • Discretization of fields • Coursework B David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Some notation A dataset has N records {R1,R2,…,RN}, … each of which has F fields {A1,A2,…,AF}. Field AF is the target field – the one we want to predict. v(i,j) is the value of field Aj in record Ri So, for example, what is David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Some more terminology Sometimes the values of a field are numeric. We call this a numeric field (!) Sometimes the values of a field are words or categories. E.g. {red, blue, green} {true, false} {male, female} {low, medium, high} this type of field is called a categorical field, or a nominal field. David Corne, , Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

So, what is this mostly about? You have some data: David Corne, , Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

So, what is this mostly about? You have some data: You get a new line of data, but it is not classified David Corne Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

So, what is this mostly about? You have some data: You get a new line of data, but it is not classified You need to predict the class value David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Broadly how it’s done In order to be able to predict the class value of any new line of data that comes along, this is what happens: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Broadly how it’s done In order to be able to predict the class value of any new line of data that comes along, this is what happens: Get your already-classified dataset Run a machine learning method on these data, which produces a classifier Use the classifier to make predictions of the class value for new instances David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

E.g. here is some data we saw before If we run a machine learning method like RIPPER, the classifer that comes out is a rule (or a set of rules), e.g.: If (A1<5) AND (A5<2) THEN class is A If (A2>1) THEN class is B David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

E.g. here is some data we saw before If instead we do machine learning with a Neural Network, or Naive Bayes, or Regression, etc …, then the output classifier is (more or less) a thresholded mathematical function of the attribute values. The type of function varies a lot between methods, but, it might be e.g.: With the convention that: if this is < 0 the class is A, otherwise B David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Or the classifier could be a decision tree, such as: A1>2 ? no yes A4>7 ? A4<5 ? yes no Class is A Class is B Class is B David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Applications • Just a reminder: you should already be developing an awareness • of this from previous lectures, from browsing data repositories, and • from using your imagination. • Business, Commerce, Engineering, Medicine, and Science provide • endless important applications for this process, in which: • there are existing classified datasets, that we can learn from • and for which producing a classifier that can predict class • values for new data instances is highly valuable. David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

An Important Issue Overfitting: It is very possible to learn a classifier that gives accurate performance on the existing classified dataset, but which performs very poorly in its predictions on new, previously unseen data. I will have more to say on overfitting, and how to combat it, in later lectures. David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The main issue for today Dataset preparation: There are various things we can and should do to our dataset so that we make it easier, or indeed possible, to apply our chosen machine learning method. E.g. we may want to apply the ID3 Decision Tree learning algorithm – but this only works with categorical data. We may want to apply a Neural Network, but this needs the data to be all numeric, and for all values to be relatively small (e.g. between -1 and 1). Etc.. David Corne Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Useful interlude: 1-NN classification • Almost the simplest possible classification method. • The `classifier’ is just the dataset itself • Predict the class of a new instance new by finding the instance c in the dataset that is closest to new, and predicting that it will have the same class value as c David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

How to work out distance between two instances? Depends on the data, and on your common sense. More on that later; but for now let’s say we are only dealing with numeric data. One way to work out distance is the SSD (sum-squared distance) If you have two records a and b, then the distance between them can be given by: Just add up the squared differences of the fields, omitting the class field. Why don’t I take the square root of this? David Corne, , Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Visualisation of 1-NN Suppose you have data with two numeric fields and one class field, which is either red or green. We can treat the data as 2D points, coloured by the class. Suppose this is the pre-classified dataset: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Visualisation of 1-NN Suppose you have data with two numeric fields and one class field, which is either red or green. We can treat the data as 2D points, coloured by the class. Suppose this is the pre-classified dataset: A new unclassified instance comes along – what is the class? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Visualisation of 1-NN Suppose you have data with two numeric fields and one class field, which is either red or green. We can treat the data as 2D points, coloured by the class. Suppose this is the pre-classified dataset: The way you answer this visually is very similar to the way 1-NN works David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Life is difficult / interesting Just an aside to point out that, in many interesting and important classification problems, the situation is more like this: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The performance of 1-NN classification Calculate it for any dataset of R records like this: Initialise accuracy to 0. For each record r Find the closest other record, o, in the dataset If the class value of o is the same as the class value of r, than add 1 to accuracy Return 100 * accuracy/R Thereby, you obtain a percentage accuracy result David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Scaling, Normalisation and Discretization So we will learn these things: Ways to do normalisation and scaling (which may be necessary or sensible, depending on how the data were generated, and on what DM/ML methods you want to use) Ways to do discretization (for turning numeric fields into categorical fields) David Corne, , Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Normalisation You have data about word-counts for specific words on web pages. You want to be able to predict whether the page is about sport or business. What category would 1-NN predict for the 4th record? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Normalisation The closest record to the 4th is the 3rd, so it would be predicted to be sport. But this is probably wrong. Why? David Corne, r, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Normalisation The closest record to the 4th is the 3rd, so it would be predicted to be sport. But this is probably wrong. Why? How could you pre-process the data to help ML methods? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Record-wise Normalisation In these data, the sensible thing to do is transform each record so that it is normalised by the total number of words. This makes them more comparable, each providing a “fingerprint” in terms of the relative proportions of the probe words. Sometimes this is necessary, sometimes it is useful – it just takes common sense. David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Record-wise normalised version Each row is simply scaled so that he numeric fields add up to 1 1-NN now shows record 4 closer to 1 and 2 David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Record-wise Normalisation Using the notation in slide 3 – given a dataset of N numeric fields, how would you represent record-wise normalization of record r ? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Min-Max Normalisation This is columnwise normalisation, done separately for each field. For each numeric attribute, scale it so that each value is in a specific range [a, b]. E.g. usually [0,1], sometimes [-1, 1], etc. E.g. if we use [0,1], we do this for each record r. Where min(i) is the smallest value for field i in the dataset, max(i) is the largest, and range(i) is max(i) – min(i). David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

When min-max normalisation might be useful David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

What is the closest record to record 5? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

What is the closest record to record 5? Will this be a good predictor for DMML Mark? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

What is the closest record to record 5? Will this be a good predictor for DMML Mark? Which field will be most important for predicting Mark? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

What is the closest record to record 5? Will this be a good predictor for DMML Mark? Which field will be most important for predicting Mark? How does min-max normalisation help? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Discretization We can’t run the ID3 decision tree algorithm on this. David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Discretization We can run the ID3 decision tree algorithm on this. David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Discretization Discretization is simply a process that converts a numerical field into a class field. How might you do this? David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Equal Width Binning (EWB) EWB(5), for example, means you discretize into 5 values, where each value has equal `width’. If field f values range from 0 to 100, then, then each bin has width 20. In the converted dataset, we can just label this bins 1,2,…,5, or we can use appropriate linguistic terms, such as: {very low, low, medium, high, very high} David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Equal Frequency Binning (EFB) EFB(3), for example, means you discretize into 3 values, where each bin covers around the same amount of data. David Corne, , Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Contrived Example for EWB(2) David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Other discretization methods There are others, which almost all differ from EWB and EFB in one respect: they use class value information in attempt to find bins that make good sense with regard to classification. We may look at those later David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Finally: Missing values Often, a dataset has missing values for some fields in some records. E.g. a certain blood test was not taken for some patients. A questionnaire response for some question was unreadable, etc… David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Coursework B There are several ways to deal with missing values. Find out yourself Do some research with google, or google scholar (or your favourite other search engine). Write a report of only ~100 words, entitled “Methods for dealing with missing values in datasets” Provide at least 2 relevant references (to papers or URLs) at the end of your report. The whole thing should only take 30mins—1hr David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Next week Basic statistics Histograms A little more normalisation and discretization Coursework 1 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Extra slides on Data Quality / Data Cleaning I am not teaching Data Quality / Data Cleaning as part of F21DL this year. But here are some slides you may be interested in anyway David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

On Data Quality Suppose you have a database sitting in front of you, and I ask ``Is it a good quality database?’’ What is your answer? What does quality depend on? Note: this is about the data themselves, not the system in use to access it. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

A Conventional Definition of Data Quality Good quality data are: Accurate, Complete, Unique, Up-to-date, and Consistent ; meaning … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

A Conventional Definition of Data Quality, continued … Accurate: This refers to how the data were recorded in the first place. What might be the inaccurately recorded datum in the following table? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

A Conventional Definition of Data Quality, continued … Complete: This refers to whether or not the database really contains everything it is supposed to contain. E.g. a patient’s medical records should contain references to all medication prescribed to date for that patient. The BBC TV Licensing DB should contain an entry for every address in the country. Does it? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Data Mining (and machine learning)