Data mining chapter 2 input concepts instances and attributes
Download
1 / 92

Data Mining Chapter 2 Input: Concepts, Instances, and Attributes - PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on

Data Mining Chapter 2 Input: Concepts, Instances, and Attributes. Kirk Scott. Hopefully the idea of instances and attributes is clear Assuming there is something in the data to be mined, either this is the concept, or the concept is inherent in this

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Data Mining Chapter 2 Input: Concepts, Instances, and Attributes' - rory


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data mining chapter 2 input concepts instances and attributes

Data MiningChapter 2Input: Concepts, Instances, and Attributes

Kirk Scott


  • Hopefully the idea of instances and attributes is clear

  • Assuming there is something in the data to be mined, either this is the concept, or the concept is inherent in this

  • Earlier data mining was defined as finding a structural representation

  • Essentially the same idea is now expressed as finding a concept description


Concept description
Concept Description

  • The concept description needs to be:

  • Intelligible

    • It can be understood, discussed, disputed

  • Operation

    • It can be applied to actual examples



Reiteration of types of discovery
Reiteration of Types of Discovery

  • Classification

    • Prediction

  • Clustering

    • Outliers

  • Association

  • Each of these is a concept

  • Successful accomplishment of these for a data set is a concept description


Recall examples
Recall Examples

  • Weather, contact lenses, iris, labor contracts

  • All were essentially classification problems

  • In general, the assumption is that classes are mutually exclusive

  • In complicated problems, data sets may be classified in multiple ways

  • This means individual instances can be “multilabeld”


Supervised learning
Supervised Learning

  • Classification learning is supervised

  • There is a training set

  • A structural representation is derived by examining a set of instances where the classification is known

  • How to test this?

  • Apply the results to another data set with known classifications


Association rules
Association Rules

  • In any given data set there can be many association rules

  • The total may approach n(n – 1) / 2 for n attributes

  • The book doesn’t use the terms support and confidence, but it discusses these concepts

  • These terms will be introduced


Support for association rules
Support for Association Rules

  • Let an association rule X = (x1, x2, …, xi)y be given in a data set with m instances

  • The support for Xy is the count of the number of instances where the combination of x values, X, occurs in the data set, divided by m

  • In other words, the association rule may be interesting if it occurs frequently enough


Confidence for association rules
Confidence for Association Rules

  • Confidence here is based on the statistical use of the term

  • The confidence for Xy is the count of the number of occurrences in the data set where this relationship holds true divided by the number of occurrences of X overall

  • The book describes this idea as accuracy

  • In other words, the association is interesting the more likely it is that X does determine y


Clustering
Clustering

  • We haven’t gotten the details yet, but this is an interesting data mining problem

  • Given a data set without predefined classes, is it possible to determine classes that the instances fall into?

  • Having determined the classes, can you then classify future instances into them?

  • Outliers are instances that you can definitely say do not fall into any of the classes


Numerical prediction
Numerical Prediction

  • This is a variation on classification

  • Given n attribute values, determine the (n + 1)st attributed value

  • Recall the CPU performance problem

  • It would be a simple matter to dream up sample data where the weather data predicted how long you would play rather than a simple yes or no

  • (The book does so)



  • The authors are trying to present some important ideas

  • In case their presentation isn’t clear, I present it here in a slightly different way

  • The basic premise goes back to this question:

  • What form does a data set have to be in in order to apply data mining techniques to it?


Data sets should be tabluar
Data Sets Should Be Tabluar

  • The simple answer based on the examples presented so far:

  • The data has to be in tabular form, instances with attributes

  • The remainder of the discussion will revolve around questions related to normalization in db


Not all data is naturally tabular
Not All Data is Naturally Tabular

  • Some data is not most naturally represented in tabular form

  • Consider OO db’s, where the natural representation is tree-like

  • How should such a representation be converted to tabular form that is amenable to data mining?


Correctly normalized data may fall into multiple tables
Correctly Normalized Data May Fall into Multiple Tables

  • You might also have data which naturally falls into >1 table

  • Or, you might have data (god forbid) that has been normalized into >1 table

  • How do you make it conform to the single table model (instances with attributes) for data mining?



Denormalization
Denormalization

  • The situation goes against the grain of correct database design

  • The classification, association, and clustering you intend to do may cross db entity boundaries

  • The fact that you want to do mining on a single tabular representation of the data means you have to denormalize



The book s family examples
The Book’s Family Examples

  • Family relationships are typically viewed in tree-like form

  • The book considers a family tree and the relationship “is a sister of”

  • The factors for inferring sisterhood:

  • Two people, one female

  • The same (or at least one common) parents for both people


Two people in the same table
Two People in the Same Table

  • Suppose you want to do this in tabular form

  • You end up with the two people who might be in a sisterhood relationship in the same table

  • These occurrences of people are matched with a classification, yes or no



  • In theory, you might restrict your attention only to those rows where the classification was yes

  • This restriction is known as the “closed world assumption” in data mining

  • Unfortunately, it is hardly ever the case that you have a problem where this kind of simplifying assumption applies

  • You have to deal with all cases


Two people with attributes in the same table
Two People with Attributes in the Same Table rows where the classification was yes

  • Suppose the two people are only listed by name in the table, without parent information

  • The classification might be correct, but this is of no help

  • There are no attributes to infer sisterhood from

  • The table has to include attributes about the two people, namely parent information


The connection with normalization
The Connection with Normalization rows where the classification was yes

  • There is a problem with denormalized data mining which is completely analogous to the normalization problem

  • Suppose you have two people in the same instance (the same row) with their attributes

  • By definition, you will have stray dependencies

  • The Person identifiers determine the attributes values


  • So far we’ve considered classification rows where the classification was yes

  • However, what would happen if you mined for associations?

  • The algorithm would find the perfectly true, but already known associations between the pk identifiers of the people and their attribute fields

  • This is not helpful

  • It’s a waste of effort


Recursive relationships
Recursive Relationships rows where the classification was yes

  • Recall the monarch and product-assembly examples from db

  • These give tables in recursive relationships with themselves or others

  • In terms of the book’s example, how do you deal with parenthood when there is a potentially unlimited sequence of ancestors?



One to many relationships
One-to-Many Relationships rules

  • A denormalized table might be the result joining two tables in a pk-fk relationship

  • If the classification is on the “one” side of the relationship, then you have multiple instances in the table which are not independent

  • In data mining this is called a multi-instance situation



Summary of 2 2
Summary of 2.2 together actually form one example of the concept under consideration in such a problem

  • The fundamental practical idea here is that data sets have to be manipulated into a form that’s suitable for mining

  • This is the input side of data mining

  • The reality is that denormalized tables may be required

  • Data mining can be facetiously be referred to as file mining since the required form does not necessarily agree with db theory


  • The situation can be restated in this way: together actually form one example of the concept under consideration in such a problem

  • Assemble the query results first; then mine them

  • This leads to an open question:

  • Would it be possible to develop a data mining system that could encompass >1 table, crawling through the pk-fk relationships like a query, finding assocations?


2 3 what s in an attribute
2.3 What’s in an Attribute? together actually form one example of the concept under consideration in such a problem


  • This subsection falls into two parts: together actually form one example of the concept under consideration in such a problem

  • 1. Some ideas that go back to db design and normalization questions

  • 2. Some ideas having to do with data type


Design and normalization
Design and Normalization together actually form one example of the concept under consideration in such a problem

  • You could include different kinds (subtypes) of entities in the same table

  • To make this work you would have to include all of the fields of all of the kinds of entities

  • The fields that didn’t apply to a particular instance would be null

  • The book uses transportation vehicles as an example: ships and trucks



Data types
Data Types other (

  • The simplest distinction is numeric vs. categorical

  • Some synonyms for categorical: symbolic, nominal, enumerated, discrete

  • There are also two-valued variables known as Boolean or dichotomy


Spectrum of data types
Spectrum of Data Types other (

  • 1. Nominal = unordered, unmeasurable named categories

  • Example: sunny, overcast, rainy

  • 2. Ordinal = named categories that can be put into a logical order but which have no intrinsic numeric value and no defined distance between them (support < or >)

  • Example: hot, mild, cool








  • Weka effort than doing the mining

  • From Wikipedia, the free encyclopedia

  • Jump to: navigation, search

  • For other uses, see Weka (disambiguation).


  • The effort than doing the miningWeka or woodhen (Gallirallusaustralis) is a flightless bird species of the railfamily. It is endemic to New Zealand, where four subspecies are recognized. Weka are sturdy brown birds, about the size of a chicken. As omnivores, they feed mainly on invertebrates and fruit. Weka usually lay eggs between August and January; both sexes help to incubate.


  • Behaviour effort than doing the mining

  • Where the Weka is relatively common, their furtive curiosity leads them to search around houses and camps for food scraps, or anything unfamiliar and transportable.[2]


Gathering the data together
Gathering the Data Together effort than doing the mining

  • In a large organization, different departments may manage their own data

  • Global level data mining will require integration of data from multiple databases

  • If you’re lucky, the organization has already created a unified archive, a data warehouse

  • Interesting mining may also require integrating external data into the data set


Aggregation
Aggregation effort than doing the mining

  • It may be necessary to aggregate data in order to mine it successfully

  • You may have data on parameters of interest spread through multiple instances

  • To be useful to problem solution, it may be necessary to add the values of data points together, for example


  • The type of aggregation is important effort than doing the mining

  • Remember the aggregation operators in db: COUNT, SUM, AVERAGE, etc.

  • The level of aggregation is important

  • Remember GROUP BY in db

  • Do you aggregate all instances, or is it useful to do it by subsets of some sort?


Arff format
ARFF (Format) effort than doing the mining

  • ARFF is the regular version of the data format for Weka

  • XRFF is the XML version

  • In ARFF:

  • % marks a comment

  • @ marks file descriptor information, relation, attributes, and data





Weka has three additional attribute types
Weka one, is treated no differently than any others Has Three Additional Attribute Types

  • String = the moral equivalent of VARCHAR in db

  • Date = the equivalent of DATE in db

  • Relational = Stay tuned; this will require some explanation


Relational valued attributes
Relational-Valued Attributes one, is treated no differently than any others

  • The book gives an example which is OK, but it’s not necessarily presented in the clearest way possible

  • My plan is to first give a bunch of explanatory background

  • Then explain the book’s example in a slightly different order than it does


Relational background
Relational Background one, is treated no differently than any others

  • Recall that multivalued problems can be viewed as mining the result of a 1-m join

  • In preparing a data set for mining, this is what a relational-valued attribute is:

  • It is an attribute that can contain or consist of multiple instances of the same kind of set of values, where these sets belong together for some reason


  • In a 1-m, one, is treated no differently than any otherspk-fk join, the multiple sets are the rows of the many table which belong together because they share the same fk value

  • In case this general overview isn’t clear, the idea can be illustrated with mothers and children


Mothers and children
Mothers and Children one, is treated no differently than any others

  • Suppose you ran this query:

  • SELECT *

  • FROM Mother, Child

  • WHERE Mother.motherid = Child.motherid

  • GROUP BY motherid

  • Children of the same mother would be grouped together



  • This is where relational-valued attributes come in information about children in general

  • From a relational point of view, the representation is wrong

  • First normal form says you have flat files with no repeating groups

  • But for data mining purposes, in ARFF format, you want the repeating groups


Explaining the book s example
Explaining the Book’s Example information about children in general

  • The weather adapts the weather/play a game data to a multivalued example

  • The new twist is this: Games extend over 2 days, not just one

  • Each day is still a single instance

  • But for each game, there are two of these instances which belong together







  • In the body of the ARFF table, the multivalued entries are structured in this way:

  • The data for the multiple days that belong together for a single bag_ID is enclosed in quotation marks

  • Within the quotation marks, the individual sets of day data are separated by “\n”, the new line character



Sparse data
Sparse Data following overhead

  • Some data sets are sparse

  • In this context the book means 0’s for numerical values, not nulls

  • Rather than listing everything, a row can be economically expressed by showing only the values present


  • In ARFF, the attributes for a row are: following overhead

  • identified by number starting with 0

  • Followed by the value

  • Separated by commas

  • Enclosed in braces

  • E.g.:

  • {1 X, 6 Y, 10 “Class A”}

  • This doesn’t work for nulls; you still have to include ?’s


Attribute types
Attribute Types following overhead

  • The bottom line is that ARFF only has two fundamental types: nominal and numeric

  • String attributes are effectively nominal

  • Date attributes are effectively numeric

  • (Recall the discussions of stuff like this in db)

  • The rest of this subsection has to do with numeric types in particular


Numerics as ordinals
Numerics following overhead as Ordinals

  • The important point is this:

  • Different data mining algorithms treat vanilla numeric values differently

  • One algorithm may treat numerics as ordinals, where subtraction applies, generating rules based on <, =, > comparisons


Numerics as ratio values
Numerics following overhead as Ratio Values

  • Another algorithm may treat numerics as ratio values

  • Recall that all arithmetic operations are defined in this case

  • The algorithm may normalize ratio values


Normalization
Normalization following overhead

  • Normalization means putting values into a range, most commonly the range 01

  • A simple approach for positive values: Divide any given data value by the maximum present

  • Another simple approach for positive values: Subtract the minimum from the data value and divide by (max – min)


Standardization
Standardization following overhead

  • Values can also be statistically standardized

  • Each data point is converted using this approach:

  • xstandardized = (x – μ) / σ

  • This puts the values into a distribution where the mean is 0 and the standard deviation is 1


Distance as an example of ratio values
Distance as an Example of Ratio Values following overhead

  • Consider the calculation of distance in n dimensional space, 2-space for example

  • Calculating the square root of the sum of the squares of the differences of the coordinates involves using arithmetic operators other than subtraction

  • Normalization is implicated in a situation like this


  • Given some (x, y) space, suppose x is in the range 0 following overhead10 and y is in the range 0100

  • Do you normalize both x and y before calculating distances or not?

  • Another way of stating this is, do x and y make corresponding contributions to the measure of distance between two data points or not?


Nominal attributes and distance
Nominal Attributes and Distance following overhead

  • This is a crude measure of distance for nominal attributes:

  • If two instances have the same value for that attribute, the distance between them, measure on that attribute is 0

  • If two instances have a different value for an attribute, the distance between them is 1



Nominal vs numeric
Nominal vs. Numeric engineered back to

  • Just like in db the assertion is made that an id “number” field should be TEXT—

  • In data mining there may be attributes containing numeric digits which are simply nominal fields and should be mined as such


  • Finally, some algorithms support engineered back to nominals but not ordinals

  • In the contact lens data, young < pre-presbyopic < presbyopic

  • If their ordinal relationships is not recognized, a complete and correct set of rules can still be mined

  • However a complete and correct set of rules about 1/3 as large can be mined in a system that recognizes the relationship


Missing values
Missing Values engineered back to

  • This is essentially a discussion of nulls

  • The only new element consists of two questions:

  • Can you infer anything from the absence of values?

  • Would it be possible to meaningfully code why values are absent and mine something from this?


Inaccurate values
Inaccurate Values engineered back to

  • This is essentially a discussion of data integrity

  • Both data miners and regular db users have to cope with faulty data one way or the other

  • The authors say this is especially important when mining

  • It’s especially important to the data miner if the data miner ascribes more significance to an attribute than a regular user does


The end
The End engineered back to


ad