c4 5 demo l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
C4.5 Demo PowerPoint Presentation
Download Presentation
C4.5 Demo

Loading in 2 Seconds...

play fullscreen
1 / 10

C4.5 Demo - PowerPoint PPT Presentation


  • 183 Views
  • Uploaded on

C4.5 Demo. Andrew Rosenberg CS4701 11/30/04. What is c4.5?. c4.5 is a program that creates a decision tree based on a set of labeled input data. This decision tree can then be tested against unseen labeled test data to quantify how well it generalizes. Running c4.5. On cunix.columbia.edu

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'C4.5 Demo' - sarah


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
c4 5 demo

C4.5 Demo

Andrew Rosenberg

CS4701 11/30/04

what is c4 5
What is c4.5?
  • c4.5 is a program that creates a decision tree based on a set of labeled input data.
  • This decision tree can then be tested against unseen labeled test data to quantify how well it generalizes.
running c4 5
Running c4.5
  • On cunix.columbia.edu
    • ~amr2104/c4.5/bin/c4.5 –u –f filestem
  • On cluster.cs.columbia.edu
    • ~amaxwell/c4.5/bin/c4.5 –u –f filestem
  • c4.5 expects to find 3 files
    • filestem.names
    • filestem.data
    • filestem.test
file format names
File Format: .names
  • The file begins with a comma separated list of classes ending with a period, followed by a blank line
    • E.g, >50K, <=50K.
  • The remaining lines have the following format (note the end of line period):
    • Attribute: {ignore, discrete n, continuous, list}.
example census names
Example: census.names

>50K, <=50K.

age: continuous.

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, etc.

fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, etc.

education-num: continuous.

marital-status: Married-civ-spouse, Divorced, Never-married, etc.

occupation: Tech-support, Craft-repair, Other-service, Sales, etc.

relationship: Wife, Own-child, Husband, Not-in-family, Unmarried.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex: Female, Male.

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, etc.

file format data test
File Format: .data, .test
  • Each line in these data files is a comma separated list of attribute values ending with a class label followed by a period.
    • The attributes must be in the same order as described in the .names file.
    • Unavailable values can be entered as ‘?’
  • When creating test sets, make sure that you remove these data points from the training data.
example adult test
Example: adult.test

25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K.

38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K.

28, Local-gov, 336951, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K.

44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0, 40, United-States, >50K.

18, ?, 103497, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 30, United-States, <=50K.

34, Private, 198693, 10th, 6, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K.

29, ?, 227026, HS-grad, 9, Never-married, ?, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K.

63, Self-emp-not-inc, 104626, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 3103, 0, 32, United-States, >50K.

24, Private, 369667, Some-college, 10, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K.

55, Private, 104996, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 10, United-States, <=50K.

65, Private, 184454, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 6418, 0, 40, United-States, >50K.36, Federal-gov, 212465, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K.

c4 5 output
c4.5 Output
  • The decision tree proper.
    • (weighted training examples/weighted training error)
  • Tables of training error and testing error
  • Confusion matrix
  • You’ll want to pipe the output of c4.5 to a text file for later viewing.
    • E.g., c4.5 –u –f filestem > filestem.results
example output
Example output

capital-gain > 6849 : >50K (203.0/6.2)

| capital-gain <= 6849 :

| | capital-gain > 6514 : <=50K (7.0/1.3)

| | capital-gain <= 6514 :

| | | marital-status = Married-civ-spouse: >50K (18.0/1.3)

| | | marital-status = Divorced: <=50K (2.0/1.0)

| | | marital-status = Never-married: >50K (0.0)

| | | marital-status = Separated: >50K (0.0)

| | | marital-status = Widowed: >50K (0.0)

| | | marital-status = Married-spouse-absent: >50K (0.0)

| | | marital-status = Married-AF-spouse: >50K (0.0)

Tree saved

Evaluation on training data (4660 items):

Before Pruning After Pruning

---------------- ---------------------------

Size Errors Size Errors Estimate

1692 366( 7.9%) 92 659(14.1%) (16.0%) <<

Evaluation on test data (2376 items):

Before Pruning After Pruning

---------------- ---------------------------

Size Errors Size Errors Estimate

1692 421(17.7%) 92 354(14.9%) (16.0%) <<

(a) (b) <-classified as

---- ----

328 251 (a): class >50K

103 1694 (b): class <=50K

k fold cross validation
k-fold Cross Validation
  • Start with one large data set.
  • Using a script, randomly divide this data set into k sets.
  • At each iteration, use k-1 sets to train the decision tree, and the remaining set to test the model.
  • Repeat this k times and take the average testing error.
  • The avg. error describes how well the learning algorithm can be applied to the data set.