- 57 Views
- Uploaded on
- Presentation posted in: General

Data Mining Chapter 2 Input: Concepts, Instances, and Attributes

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Data MiningChapter 2Input: Concepts, Instances, and Attributes

Kirk Scott

- Hopefully the idea of instances and attributes is clear
- Assuming there is something in the data to be mined, either this is the concept, or the concept is inherent in this
- Earlier data mining was defined as finding a structural representation
- Essentially the same idea is now expressed as finding a concept description

- The concept description needs to be:
- Intelligible
- It can be understood, discussed, disputed

- Operation
- It can be applied to actual examples

- Classification
- Prediction

- Clustering
- Outliers

- Association
- Each of these is a concept
- Successful accomplishment of these for a data set is a concept description

- Weather, contact lenses, iris, labor contracts
- All were essentially classification problems
- In general, the assumption is that classes are mutually exclusive
- In complicated problems, data sets may be classified in multiple ways
- This means individual instances can be “multilabeld”

- Classification learning is supervised
- There is a training set
- A structural representation is derived by examining a set of instances where the classification is known
- How to test this?
- Apply the results to another data set with known classifications

- In any given data set there can be many association rules
- The total may approach n(n – 1) / 2 for n attributes
- The book doesn’t use the terms support and confidence, but it discusses these concepts
- These terms will be introduced

- Let an association rule X = (x1, x2, …, xi)y be given in a data set with m instances
- The support for Xy is the count of the number of instances where the combination of x values, X, occurs in the data set, divided by m
- In other words, the association rule may be interesting if it occurs frequently enough

- Confidence here is based on the statistical use of the term
- The confidence for Xy is the count of the number of occurrences in the data set where this relationship holds true divided by the number of occurrences of X overall
- The book describes this idea as accuracy
- In other words, the association is interesting the more likely it is that X does determine y

- We haven’t gotten the details yet, but this is an interesting data mining problem
- Given a data set without predefined classes, is it possible to determine classes that the instances fall into?
- Having determined the classes, can you then classify future instances into them?
- Outliers are instances that you can definitely say do not fall into any of the classes

- This is a variation on classification
- Given n attribute values, determine the (n + 1)st attributed value
- Recall the CPU performance problem
- It would be a simple matter to dream up sample data where the weather data predicted how long you would play rather than a simple yes or no
- (The book does so)

- The authors are trying to present some important ideas
- In case their presentation isn’t clear, I present it here in a slightly different way
- The basic premise goes back to this question:
- What form does a data set have to be in in order to apply data mining techniques to it?

- The simple answer based on the examples presented so far:
- The data has to be in tabular form, instances with attributes
- The remainder of the discussion will revolve around questions related to normalization in db

- Some data is not most naturally represented in tabular form
- Consider OO db’s, where the natural representation is tree-like
- How should such a representation be converted to tabular form that is amenable to data mining?

- You might also have data which naturally falls into >1 table
- Or, you might have data (god forbid) that has been normalized into >1 table
- How do you make it conform to the single table model (instances with attributes) for data mining?

- Tree-like data and multi-table data may be related questions
- It would not be surprising to find that a conversion of a tree to a table resulted in >1 table

- The situation goes against the grain of correct database design
- The classification, association, and clustering you intend to do may cross db entity boundaries
- The fact that you want to do mining on a single tabular representation of the data means you have to denormalize

- In short, you combine multiple tables back into one table
- The end result is the monstrosity that is railed against in normalization theory:
- The monolithic, one-table db

- Family relationships are typically viewed in tree-like form
- The book considers a family tree and the relationship “is a sister of”
- The factors for inferring sisterhood:
- Two people, one female
- The same (or at least one common) parents for both people

- Suppose you want to do this in tabular form
- You end up with the two people who might be in a sisterhood relationship in the same table
- These occurrences of people are matched with a classification, yes or no

- Recall that according to normalization, a truly one-to-one relationship can be stored in a single table
- Pairings of all people would result in lots of instances/rows where the classification was simply no
- This isn’t too convenient

- In theory, you might restrict your attention only to those rows where the classification was yes
- This restriction is known as the “closed world assumption” in data mining
- Unfortunately, it is hardly ever the case that you have a problem where this kind of simplifying assumption applies
- You have to deal with all cases

- Suppose the two people are only listed by name in the table, without parent information
- The classification might be correct, but this is of no help
- There are no attributes to infer sisterhood from
- The table has to include attributes about the two people, namely parent information

- There is a problem with denormalized data mining which is completely analogous to the normalization problem
- Suppose you have two people in the same instance (the same row) with their attributes
- By definition, you will have stray dependencies
- The Person identifiers determine the attributes values

- So far we’ve considered classification
- However, what would happen if you mined for associations?
- The algorithm would find the perfectly true, but already known associations between the pk identifiers of the people and their attribute fields
- This is not helpful
- It’s a waste of effort

- Recall the monarch and product-assembly examples from db
- These give tables in recursive relationships with themselves or others
- In terms of the book’s example, how do you deal with parenthood when there is a potentially unlimited sequence of ancestors?

- In general, the answer is that you would need recursive rules
- Mining recursive rules is a step beyond classification, association, etc.
- The good news is that this topic will not be covered further
- It’s simply of interest to know that such problems can arise

- A denormalized table might be the result joining two tables in a pk-fk relationship
- If the classification is on the “one” side of the relationship, then you have multiple instances in the table which are not independent
- In data mining this is called a multi-instance situation

- The multiple instances belonging to one classification together actually form one example of the concept under consideration in such a problem
- Data mining algorithms have been developed to handle cases like these
- They will be presented with the other algorithms later

- The fundamental practical idea here is that data sets have to be manipulated into a form that’s suitable for mining
- This is the input side of data mining
- The reality is that denormalized tables may be required
- Data mining can be facetiously be referred to as file mining since the required form does not necessarily agree with db theory

- The situation can be restated in this way:
- Assemble the query results first; then mine them
- This leads to an open question:
- Would it be possible to develop a data mining system that could encompass >1 table, crawling through the pk-fk relationships like a query, finding assocations?

- This subsection falls into two parts:
- 1. Some ideas that go back to db design and normalization questions
- 2. Some ideas having to do with data type

- You could include different kinds (subtypes) of entities in the same table
- To make this work you would have to include all of the fields of all of the kinds of entities
- The fields that didn’t apply to a particular instance would be null
- The book uses transportation vehicles as an example: ships and trucks

- You could also have fields in a table that depend on each other (ack)
- The book gives married T/F and spouse’s name as examples
- Again, you can handle this with null values

- The simplest distinction is numeric vs. categorical
- Some synonyms for categorical: symbolic, nominal, enumerated, discrete
- There are also two-valued variables known as Boolean or dichotomy

- 1. Nominal = unordered, unmeasurable named categories
- Example: sunny, overcast, rainy
- 2. Ordinal = named categories that can be put into a logical order but which have no intrinsic numeric value and no defined distance between them (support < or >)
- Example: hot, mild, cool

- 3. Interval = numeric values where the distance between them makes sense (support subtraction) but other operations do not
- Example: Time expressed in years

- 4. Ratio = numeric values where all operations make sense
- These are real or continuous (or possibly integer) values on a scale with a natural 0 point
- Example: Physical distance

- In principle, data mining has to handle all possible types of data
- In practice, applied systems typically have some useful subset of the type distinctions given above
- You adapt your data to the types provided

- In practice, preparing the data can take more time and effort than doing the mining
- Data needs to be in the format required by whatever mining software you’re using
- In Weka, this is ARFF = attribute relation file format

- Real data tends to be low in quality
- Think data integrity and completeness
- “Cleaning” the data before mining it pays off

- Weka
- From Wikipedia, the free encyclopedia
- Jump to: navigation, search
- For other uses, see Weka (disambiguation).

- The Weka or woodhen (Gallirallusaustralis) is a flightless bird species of the railfamily. It is endemic to New Zealand, where four subspecies are recognized. Weka are sturdy brown birds, about the size of a chicken. As omnivores, they feed mainly on invertebrates and fruit. Weka usually lay eggs between August and January; both sexes help to incubate.

- Behaviour
- …
- Where the Weka is relatively common, their furtive curiosity leads them to search around houses and camps for food scraps, or anything unfamiliar and transportable.[2]

- In a large organization, different departments may manage their own data
- Global level data mining will require integration of data from multiple databases
- If you’re lucky, the organization has already created a unified archive, a data warehouse
- Interesting mining may also require integrating external data into the data set

- It may be necessary to aggregate data in order to mine it successfully
- You may have data on parameters of interest spread through multiple instances
- To be useful to problem solution, it may be necessary to add the values of data points together, for example

- The type of aggregation is important
- Remember the aggregation operators in db: COUNT, SUM, AVERAGE, etc.
- The level of aggregation is important
- Remember GROUP BY in db
- Do you aggregate all instances, or is it useful to do it by subsets of some sort?

- ARFF is the regular version of the data format for Weka
- XRFF is the XML version
- In ARFF:
- % marks a comment
- @ marks file descriptor information, relation, attributes, and data

- For a categorical attribute, the set of values is given in braces
- Categorical values containing spaces have to be put in quotation marks
- Numeric attributes are simply identified as numeric

- Instances in the file are given line by line
- They are separate by new lines
- Attributes values in instances are separated by commas
- Missing values (nulls) are indicated with a question mark

- In an ARFF file, a classification attribute, if there is one, is treated no differently than any others
- The format is equally suited to classification, association, or cluster mining
- Figure 2.2, on the following overhead, shows the weather data set in ARFF

- String = the moral equivalent of VARCHAR in db
- Date = the equivalent of DATE in db
- Relational = Stay tuned; this will require some explanation

- The book gives an example which is OK, but it’s not necessarily presented in the clearest way possible
- My plan is to first give a bunch of explanatory background
- Then explain the book’s example in a slightly different order than it does

- Recall that multivalued problems can be viewed as mining the result of a 1-m join
- In preparing a data set for mining, this is what a relational-valued attribute is:
- It is an attribute that can contain or consist of multiple instances of the same kind of set of values, where these sets belong together for some reason

- In a 1-m, pk-fk join, the multiple sets are the rows of the many table which belong together because they share the same fk value
- In case this general overview isn’t clear, the idea can be illustrated with mothers and children

- Suppose you ran this query:
- SELECT *
- FROM Mother, Child
- WHERE Mother.motherid = Child.motherid
- GROUP BY motherid
- Children of the same mother would be grouped together

- In data mining, it is possible that you would want to elicit information about children in general
- You might also want to elicit information that generally held for children of the same mother
- Conceptually, you would be mining information about siblinghood

- This is where relational-valued attributes come in
- From a relational point of view, the representation is wrong
- First normal form says you have flat files with no repeating groups
- But for data mining purposes, in ARFF format, you want the repeating groups

- The weather adapts the weather/play a game data to a multivalued example
- The new twist is this: Games extend over 2 days, not just one
- Each day is still a single instance
- But for each game, there are two of these instances which belong together

- The instances in the rows of the table representing game information will be multivalued
- Each game will contain two days’ worth of weather data
- These two days’ worth of data are the contents of one relational-valued attribute in the overall data set

- Note that in general, relational attributes, multivalued attributes, are not limited to 2 sets of values
- This is just an artifact of their example, where games last exactly 2 days

- The book also uses terminology which could be clearer
- In their new weather table they name the relational attribute “bag”
- Suffice it to say that it would have been clearer if they had named the attribute game_days or something like that

- There are three major attributes in the book’s table:
- bag_ID (id for sets of days belonging to games)
- bag (multivalued relational attribute containing days belonging to games, grouped by game)
- play (the classification to play, yes or no)

- The bag attribute has 4 (familiar) attributes describing the multivalued instances (of day):
- outlook
- temperature
- humidity
- windy

- In the body of the ARFF table, the multivalued entries are structured in this way:
- The data for the multiple days that belong together for a single bag_ID is enclosed in quotation marks
- Within the quotation marks, the individual sets of day data are separated by “\n”, the new line character

- The book’s ARFF table is shown in Figure 2.3 on the following overhead

- Some data sets are sparse
- In this context the book means 0’s for numerical values, not nulls
- Rather than listing everything, a row can be economically expressed by showing only the values present

- In ARFF, the attributes for a row are:
- identified by number starting with 0
- Followed by the value
- Separated by commas
- Enclosed in braces
- E.g.:
- {1 X, 6 Y, 10 “Class A”}
- This doesn’t work for nulls; you still have to include ?’s

- The bottom line is that ARFF only has two fundamental types: nominal and numeric
- String attributes are effectively nominal
- Date attributes are effectively numeric
- (Recall the discussions of stuff like this in db)
- The rest of this subsection has to do with numeric types in particular

- The important point is this:
- Different data mining algorithms treat vanilla numeric values differently
- One algorithm may treat numerics as ordinals, where subtraction applies, generating rules based on <, =, > comparisons

- Another algorithm may treat numerics as ratio values
- Recall that all arithmetic operations are defined in this case
- The algorithm may normalize ratio values

- Normalization means putting values into a range, most commonly the range 01
- A simple approach for positive values: Divide any given data value by the maximum present
- Another simple approach for positive values: Subtract the minimum from the data value and divide by (max – min)

- Values can also be statistically standardized
- Each data point is converted using this approach:
- xstandardized = (x – μ) / σ
- This puts the values into a distribution where the mean is 0 and the standard deviation is 1

- Consider the calculation of distance in n dimensional space, 2-space for example
- Calculating the square root of the sum of the squares of the differences of the coordinates involves using arithmetic operators other than subtraction
- Normalization is implicated in a situation like this

- Given some (x, y) space, suppose x is in the range 010 and y is in the range 0100
- Do you normalize both x and y before calculating distances or not?
- Another way of stating this is, do x and y make corresponding contributions to the measure of distance between two data points or not?

- This is a crude measure of distance for nominal attributes:
- If two instances have the same value for that attribute, the distance between them, measure on that attribute is 0
- If two instances have a different value for an attribute, the distance between them is 1

- There are cases where nominal attributes can be reverse engineered back to numerics
- One example from the book: Zip codes and geographic location coordinates
- Recall that zip codes came up in db in a similar way as determinants of geographic locations

- Just like in db the assertion is made that an id “number” field should be TEXT—
- In data mining there may be attributes containing numeric digits which are simply nominal fields and should be mined as such

- Finally, some algorithms support nominals but not ordinals
- In the contact lens data, young < pre-presbyopic < presbyopic
- If their ordinal relationships is not recognized, a complete and correct set of rules can still be mined
- However a complete and correct set of rules about 1/3 as large can be mined in a system that recognizes the relationship

- This is essentially a discussion of nulls
- The only new element consists of two questions:
- Can you infer anything from the absence of values?
- Would it be possible to meaningfully code why values are absent and mine something from this?

- This is essentially a discussion of data integrity
- Both data miners and regular db users have to cope with faulty data one way or the other
- The authors say this is especially important when mining
- It’s especially important to the data miner if the data miner ascribes more significance to an attribute than a regular user does