Data Mining Chapter 2 Input: Concepts, Instances, and Attributes

Data MiningChapter 2Input: Concepts, Instances, and Attributes Kirk Scott

Hopefully the idea of instances and attributes is clear • Assuming there is something in the data to be mined, either this is the concept, or the concept is inherent in this • Earlier data mining was defined as finding a structural representation • Essentially the same idea is now expressed as finding a concept description

Concept Description • The concept description needs to be: • Intelligible • It can be understood, discussed, disputed • Operation • It can be applied to actual examples

2.1 What’s a Concept?

Reiteration of Types of Discovery • Classification • Prediction • Clustering • Outliers • Association • Each of these is a concept • Successful accomplishment of these for a data set is a concept description

Recall Examples • Weather, contact lenses, iris, labor contracts • All were essentially classification problems • In general, the assumption is that classes are mutually exclusive • In complicated problems, data sets may be classified in multiple ways • This means individual instances can be “multilabeld”

Supervised Learning • Classification learning is supervised • There is a training set • A structural representation is derived by examining a set of instances where the classification is known • How to test this? • Apply the results to another data set with known classifications

Association Rules • In any given data set there can be many association rules • The total may approach n(n – 1) / 2 for n attributes • The book doesn’t use the terms support and confidence, but it discusses these concepts • These terms will be introduced

Support for Association Rules • Let an association rule X = (x1, x2, …, xi)y be given in a data set with m instances • The support for Xy is the count of the number of instances where the combination of x values, X, occurs in the data set, divided by m • In other words, the association rule may be interesting if it occurs frequently enough

Confidence for Association Rules • Confidence here is based on the statistical use of the term • The confidence for Xy is the count of the number of occurrences in the data set where this relationship holds true divided by the number of occurrences of X overall • The book describes this idea as accuracy • In other words, the association is interesting the more likely it is that X does determine y

Clustering • We haven’t gotten the details yet, but this is an interesting data mining problem • Given a data set without predefined classes, is it possible to determine classes that the instances fall into? • Having determined the classes, can you then classify future instances into them? • Outliers are instances that you can definitely say do not fall into any of the classes

Numerical Prediction • This is a variation on classification • Given n attribute values, determine the (n + 1)st attributed value • Recall the CPU performance problem • It would be a simple matter to dream up sample data where the weather data predicted how long you would play rather than a simple yes or no • (The book does so)

2.2 What’s in an Example?

The authors are trying to present some important ideas • In case their presentation isn’t clear, I present it here in a slightly different way • The basic premise goes back to this question: • What form does a data set have to be in in order to apply data mining techniques to it?

Data Sets Should Be Tabluar • The simple answer based on the examples presented so far: • The data has to be in tabular form, instances with attributes • The remainder of the discussion will revolve around questions related to normalization in db

Not All Data is Naturally Tabular • Some data is not most naturally represented in tabular form • Consider OO db’s, where the natural representation is tree-like • How should such a representation be converted to tabular form that is amenable to data mining?

Correctly Normalized Data May Fall into Multiple Tables • You might also have data which naturally falls into >1 table • Or, you might have data (god forbid) that has been normalized into >1 table • How do you make it conform to the single table model (instances with attributes) for data mining?

Tree-like data and multi-table data may be related questions • It would not be surprising to find that a conversion of a tree to a table resulted in >1 table

Denormalization • The situation goes against the grain of correct database design • The classification, association, and clustering you intend to do may cross db entity boundaries • The fact that you want to do mining on a single tabular representation of the data means you have to denormalize

In short, you combine multiple tables back into one table • The end result is the monstrosity that is railed against in normalization theory: • The monolithic, one-table db

The Book’s Family Examples • Family relationships are typically viewed in tree-like form • The book considers a family tree and the relationship “is a sister of” • The factors for inferring sisterhood: • Two people, one female • The same (or at least one common) parents for both people

Two People in the Same Table • Suppose you want to do this in tabular form • You end up with the two people who might be in a sisterhood relationship in the same table • These occurrences of people are matched with a classification, yes or no

Recall that according to normalization, a truly one-to-one relationship can be stored in a single table • Pairings of all people would result in lots of instances/rows where the classification was simply no • This isn’t too convenient

In theory, you might restrict your attention only to those rows where the classification was yes • This restriction is known as the “closed world assumption” in data mining • Unfortunately, it is hardly ever the case that you have a problem where this kind of simplifying assumption applies • You have to deal with all cases

Two People with Attributes in the Same Table • Suppose the two people are only listed by name in the table, without parent information • The classification might be correct, but this is of no help • There are no attributes to infer sisterhood from • The table has to include attributes about the two people, namely parent information

The Connection with Normalization • There is a problem with denormalized data mining which is completely analogous to the normalization problem • Suppose you have two people in the same instance (the same row) with their attributes • By definition, you will have stray dependencies • The Person identifiers determine the attributes values

So far we’ve considered classification • However, what would happen if you mined for associations? • The algorithm would find the perfectly true, but already known associations between the pk identifiers of the people and their attribute fields • This is not helpful • It’s a waste of effort

Recursive Relationships • Recall the monarch and product-assembly examples from db • These give tables in recursive relationships with themselves or others • In terms of the book’s example, how do you deal with parenthood when there is a potentially unlimited sequence of ancestors?

In general, the answer is that you would need recursive rules • Mining recursive rules is a step beyond classification, association, etc. • The good news is that this topic will not be covered further • It’s simply of interest to know that such problems can arise

One-to-Many Relationships • A denormalized table might be the result joining two tables in a pk-fk relationship • If the classification is on the “one” side of the relationship, then you have multiple instances in the table which are not independent • In data mining this is called a multi-instance situation

The multiple instances belonging to one classification together actually form one example of the concept under consideration in such a problem • Data mining algorithms have been developed to handle cases like these • They will be presented with the other algorithms later

Summary of 2.2 • The fundamental practical idea here is that data sets have to be manipulated into a form that’s suitable for mining • This is the input side of data mining • The reality is that denormalized tables may be required • Data mining can be facetiously be referred to as file mining since the required form does not necessarily agree with db theory

The situation can be restated in this way: • Assemble the query results first; then mine them • This leads to an open question: • Would it be possible to develop a data mining system that could encompass >1 table, crawling through the pk-fk relationships like a query, finding assocations?

2.3 What’s in an Attribute?

This subsection falls into two parts: • 1. Some ideas that go back to db design and normalization questions • 2. Some ideas having to do with data type

Design and Normalization • You could include different kinds (subtypes) of entities in the same table • To make this work you would have to include all of the fields of all of the kinds of entities • The fields that didn’t apply to a particular instance would be null • The book uses transportation vehicles as an example: ships and trucks

You could also have fields in a table that depend on each other (ack) • The book gives married T/F and spouse’s name as examples • Again, you can handle this with null values

Data Types • The simplest distinction is numeric vs. categorical • Some synonyms for categorical: symbolic, nominal, enumerated, discrete • There are also two-valued variables known as Boolean or dichotomy

Spectrum of Data Types • 1. Nominal = unordered, unmeasurable named categories • Example: sunny, overcast, rainy • 2. Ordinal = named categories that can be put into a logical order but which have no intrinsic numeric value and no defined distance between them (support < or >) • Example: hot, mild, cool

3. Interval = numeric values where the distance between them makes sense (support subtraction) but other operations do not • Example: Time expressed in years

4. Ratio = numeric values where all operations make sense • These are real or continuous (or possibly integer) values on a scale with a natural 0 point • Example: Physical distance

In principle, data mining has to handle all possible types of data • In practice, applied systems typically have some useful subset of the type distinctions given above • You adapt your data to the types provided

2.4 Preparing the Input

In practice, preparing the data can take more time and effort than doing the mining • Data needs to be in the format required by whatever mining software you’re using • In Weka, this is ARFF = attribute relation file format

Real data tends to be low in quality • Think data integrity and completeness • “Cleaning” the data before mining it pays off

Weka • From Wikipedia, the free encyclopedia • Jump to: navigation, search • For other uses, see Weka (disambiguation).

The Weka or woodhen (Gallirallusaustralis) is a flightless bird species of the railfamily. It is endemic to New Zealand, where four subspecies are recognized. Weka are sturdy brown birds, about the size of a chicken. As omnivores, they feed mainly on invertebrates and fruit. Weka usually lay eggs between August and January; both sexes help to incubate.

Behaviour • … • Where the Weka is relatively common, their furtive curiosity leads them to search around houses and camps for food scraps, or anything unfamiliar and transportable.[2]

Data Mining Chapter 2 Input: Concepts, Instances, and Attributes