1 / 42

# An introduction to data mining --Who should provide Cake for PGF? - PowerPoint PPT Presentation

An introduction to data mining --Who should provide Cake for PGF?. Peng Yin MI 6 16/10/2011. Outline. Overview of data mining Background Data set and Tables What is data mining Decision tree Decision tree analysis Common Uses of Data Mining. Backgroud.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' An introduction to data mining --Who should provide Cake for PGF?' - xandy

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### An introduction to data mining--Who should provide Cake for PGF?

Peng Yin

MI 6

16/10/2011

• Overview of data mining

• Background

• Dataset and Tables

• What is data mining

• Decision tree

• Decision tree analysis

• Common Uses of Data Mining

• We got notice from some secret organization saying that at North England there is a extremely dangerous group hiding in Newcastle, named NCL-MS

• • It is a tiny subset of the PG students Personal Secrets.

• • Some important data missing, while I gained from CIS. Thanks to Robin Henderson.

• Used Attributes

• This color=Real valued This color=Symbol valued

• Successfully loaded the dataset with 10 attributes and 15records

• Well, we can look at histograms..

• Female

• Male

• Pure

• Applied

• Stats

• A better name for a histogram:

• A One-dimensional Contingency Table

• Recipe for making a k-dimensional contingency table:

• 1. Pick k attributes from your dataset. Call them a1,a2, … ak.

• 2. For every possible combination of values, a1,=x1, a2,=x2,… ak,=xk ,record how frequently that combination occurs

• Fun fact: A database person would call this a “k-dimensional datacube”

For each pair of

values for

attributes

(year, wealth)

we can see how

many records

match.

• Easier to see

“interesting” things if we

stretch out the Histogram bars

• • These are harder to look at!

• 1st year 2nd 3rd 4th

Rich

Poor

Male

F

• Software packages and database add-ons to do this are known as OLAP tools

• They usually include point and click navigation to view slices and aggregates of contingency tables

• They usually include nice histogram visualization

• • Why would people want to look at contingency tables?

• With 10 attributes, how many 1-d contingency tables are there?

• • How many 2-d contingency tables?

• • How many 3-d tables?

• • With 100 attributes how many 3-d tables are there?

• With 10 attributes, how many 1-d contingency tables are there? 10

• • How many 2-d contingency tables? 10 * 9 / 2 = 45

• • How many 3-d tables? 120

• • With 100 attributes how many 3-d tables are there? 161,700

• • Looking at one contingency table: can be as much fun as reading an interesting book

• • Looking at ten tables: as much fun as watching BBC One

• • Looking at 100 tables: as much fun as watching an infomercial

• • Looking at 100,000 tables: as much fun as a three-week November vacation in Sunderland with a dying weasel.

• Data Mining is all about automating the process of searching for patterns in the data.

• Which patterns are interesting?

• Which might be mere illusions?

• And how can they be exploited?

That is what we’ll look at now.

And the answer will turn out to be pgf cake with decision tree learning.

• • Information Gain for measuring association between inputs and outputs

• • Learning a decision tree classifier from data

• — Simplification and automation of the overall statisticalprocess, from data source(s) to model application— Changed over the years— Replace statistician ð Better models, less grunge work— 1 + 1 = 0— Many different data mining algorithms / tools available— Statistical expertise required to compare different techniques— Build intelligence into the software

• Decision Trees

• Nearest Neighbour Classification

• Neural Networks

• Rule Induction

• K-means Clustering

• A Decision Tree is a tree-structured plan of a set of attributes to test in order to predict the output.

• To decide which attribute should be tested first, simply find the one with the highest information gain.

• Then recurse…

Records for 1st year students

Records for 2nd

year students

Records for 4th year students

Predict rich

Predict rich

Predict rich

Predict rich

Don’t Split a node if all matching records have the same output value

Predict rich

Predict rich

Predict rich

Predict rich

Predict rich

Predict rich

Don’t split a node if none of the attributes can create multiple non-empty children

Predict rich

Predict rich

• • Base Case One: If all records in current data subset have the same output then don’t recurse

• • Base Case Two: If all records have exactly the same set of input attributes then don’t recurse

Basic Decision Tree BuildingSummarized

• Build Tree (Dataset, Output)

• If all output values are the same in Dataset, return a leaf node that says “predict this unique output”

• If all input values are the same, return a leaf node that says “predict the majority output”

• Else find attribute X with highest Info Gain

• Suppose X has nX distinct values (i.e. X has aritynX).

• Create and return a non-leaf node with nX children.

• The i’th child should be built by calling

Build Tree (DSi, Output)

• Where DSi built consists of all those records in Dataset for which X = ith distinct value of X.

• — Data warehousing— SQL / Ad Hoc Queries / Reporting— Software Agents— Online Analytical Processing (OLAP)— Data Visualization

• — Direct mail marketing— Web site personalization— Credit card fraud detection— Gas & jewelry— Bioinformatics— Text analysis— SAS lie detector— Market basket analysis— Beer & baby diapers:

the

information

gains…

• Andrew Moore

• http://www.autonlab.org/tutorials/

• Doug Alexander

• http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/

• Informationgain

• http://en.wikipedia.org/wiki/Information_gain_in_decision_trees