E N D
1. Extending SAS/Enterprise Miner a users experience
2. Outline Simulated and public domain data
Portable SAS Code Nodes
problem independent
Still in development
not as robust as professional nodes
One example works outside of EM accessing the information stored within EM
3. Dictionary
4. Score vs Training data We often want to make a prediction in a Score dataset where we do not know the outcome based on a model derived from a Training dataset where we do.
Important that the Training data should represent the Score data.
Often impossible exactly, so should understand if there are differences.
5. Simulated and public domain data Simulated
Target = x + e(0,s)
Target = x + x2 + e(0,s)
Target = x + e(0,x s)
Boston Housing data
http://lib.stat.cmu.edu/datasets/
relatively small
6. 1. Comparing distributions Visual comparison of distributions on Training and Score data
SAS code node problem independent
Portable
To avoid the problem of a model that does not generalise
Incorrect assignment of variables
interval when should be nominal most common
StatExplore node can do some of the tasks described here, but not as well.
7. Outline of code Sample to min (10,000 size of smallest dataset)
For interval variables plot a cumulative distribution plot
Proc Sort, Datastep, Proc GPLOT
For binary/nominal/ordinal use ANNOTATE to plot dot graphs
if necessary restricting to the 10 most common levels
check for levels absent from one of the sources
12. 2. Examining residuals from models SAS code node
Current code is restricted to models with an Interval target variable, otherwise portable
7 outputs
residual squared, and smoothed [KDE]
13. 2. Examining residuals from models, continued Density of residuals, [KDE and UNIVARIATE]
proc univariate data = residuals;
var %em_residual;
histogram %em_residual/normal name = 'plot4b';
title2 "using PROC UNIVARIATE";
Run;
Residual vs Predicted, and smoothed
PROCs KDE and GCONTOUR
Boxplot of residuals
PROCs Rank, Sort, Boxplot
14. Statistics 101 refresher In regression models patterns in the residual (predicted actual) versus predicted plot can indicate deficiencies in the model.
Classic examples:
Not having a quadratic term when the data requires it gives a parabolic plot
If there is heteroskedasticity the plot of residuals will have a fan shape
See, for example, Wikipedia
24. 3. Distributions of scored values Post modelling test
Given the model, and distribution of values in the score data, what is the distribution of the scored values?
27. 4. Delving into innards of SAS/EM Extracting the significant variables determined from a Variable Selection node
Multiple models
Alternative to a lot of copying and pasting
Use EMWS<n>.Varsel_effectds or Varsel_importnc
R2 or relative importance
Added traffic lighting for extra information
29. Issues Order of graphs
1, 10, 11 ... 19, 2, 21 etc
Lack of error does not imply error free run
30. A quote All models are wrong but some models are useful
chapter of a 1979 book by George E. Box, industrial statistician
also attributed to W. Edwards Deming
31. An ideal model Compare distributions in Training and Score
checking variables correctly assigned
Compare again post Variable Selection
Examine residuals from each model
Examine distribution of scored values
33. Acknowledgments Staff in the Tax Office
SAS Technical Support
some Code Node features not fully documented
how to re-run a node when nothing other than the code has changed (in development mode)
toggle Use Priors [Property Sheet -> Advanced]
how to compare distributions after a Variable Selection node
still issues with the width of code.