1 / 9

Data Mining

Data Mining. Data for Data Mining By Prof. Muhammad Amir Alam. Data for Data Mining.

Download Presentation

Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Data for Data Mining By Prof. Muhammad Amir Alam

  2. Data for Data Mining Data for data mining comes in many forms: from computer files typed in by human operators, business information in SQL or some other standard database format, information recorded automatically by equipment such as fault logging devices, to streams of binary data transmitted from satellites. The universe of objects is normally very large and we have only a small part of it. Usually we want to extract information from the data available to us that we hope is applicable to the large volume of data that we have not yet seen.

  3. Standard Formulation We will assume that for any data mining application we have a universe of objects that are of interest. For example all human beings alive or dead possibly all the patients at a hospital all the rocks on the moon all the pages stored in the World Wide Web

  4. Example(DataSet) The set of variable values corresponding to each of the objects is called a record or (more commonly) an instance. This dataset is an example of labelled data, where one attribute is given special significance and the aim is to predict its value. When there is no such significant attribute we call the data unlabelled.

  5. Types of Variable Six main types of variable can be distinguished. Nominal Variables: A variable used to put objects into categories, e.g. the name or colour of an object. A nominal variable may be numerical in form, but the numerical values have no mathematical interpretation. For example we might label 10 people as numbers 1, 2, 3, . . . , 10, but any arithmetic with such values Binary Variables: A binary variable is a special case of a nominal variable that takes only two possible values: true or false, 1 or 0 etc.

  6. Ordinal Variables Ordinal variables are similar to nominal variables, except that an ordinal variable has values that can be arranged in a meaningful order, e.g. small, medium,large. Integer Variables Integer variables are ones that take values that are genuine integers, for example‘numberof children’. Unlike nominal variables that are numerical in form, arithmetic with integer variables is meaningful (1 child + 2 children = 3 children etc.). Interval-scaled Variables Interval-scaled variables are variables that take numerical values which are measured at equal intervals from a zero point or origin. Ratio-scaled Variables Ratio-scaled variables are similar to interval-scaled variables except that the zero point does reflect the absence of the measured characteristic

  7. Data Preparation Data Cleaning: when the data is in the standard form it cannot be assumed that it is error free. because • Measurement errors • Subjective judgments • Malfunctioning • Or misuse of automatic recording equipment For example: The number 69.72 may accidentally be entered as 6.972 Or “bbrown” for “brown” but this type of errors are less harmonious

  8. Missing Values In many real-world datasets data values are not recorded for all attributes. This can happen for various reasons • a malfunction of the equipment used to record the data • a data collection form to which additional fields were added after some data had been collected • information that could not be obtained, e.g. about a hospital patient There are many techniques to recover missing values: Two of them are: 1. Discard Instances 2. Replace by most important/ frequent / average values

  9. Deal with missing values Discard Instances This is the simplest strategy: delete all instances where there is at least one missing value and use the remainder. Replace by Most Frequent/Average Value A straightforward but effective way of doing this for a categorical attribute is to use its most frequently occurring (non-missing) value. This is easy to justify if the attribute values are very unbalanced. For example if attribute X has possible values a, b and c which occur in proportions 80%, 15% and 5% respectively, it seems reasonable to estimate any missing values of attribute X by the value a.

More Related