Extending SAS

1. Extending SAS/Enterprise Miner� �a user�s experience

2. Outline Simulated and public domain data Portable SAS Code Nodes problem independent Still in development not as robust as professional nodes One example works outside of EM accessing the information stored within EM

3. Dictionary

4. Score vs Training data We often want to make a prediction in a Score dataset where we do not know the outcome based on a model derived from a Training dataset where we do. Important that the Training data should represent the Score data. Often impossible exactly, so should understand if there are differences.

5. Simulated and public domain data Simulated Target = x + e(0,s) Target = x + x2 + e(0,s) Target = x + e(0,x s) Boston Housing data http://lib.stat.cmu.edu/datasets/ relatively small

6. 1. Comparing distributions Visual comparison of distributions on Training and Score data SAS code node � problem independent Portable To avoid the problem of a model that does not generalise Incorrect assignment of variables interval when should be nominal most common StatExplore node can do some of the tasks described here, but not as well.

7. Outline of code Sample to min (10,000 size of smallest dataset) For interval variables plot a cumulative distribution plot Proc Sort, Datastep, Proc GPLOT For binary/nominal/ordinal use ANNOTATE to plot �dot� graphs if necessary restricting to the 10 most common levels check for levels absent from one of the sources

12. 2. Examining residuals from models SAS code node Current code is restricted to models with an Interval target variable, otherwise portable 7 outputs residual squared, and smoothed [KDE]

13. 2. Examining residuals from models, continued Density of residuals, [KDE and UNIVARIATE] proc univariate data = residuals; var %em_residual; histogram %em_residual/normal name = 'plot4b'; title2 "using PROC UNIVARIATE"; Run; Residual vs Predicted, and smoothed PROCs KDE and GCONTOUR Boxplot of residuals PROCs Rank, Sort, Boxplot

14. Statistics 101 refresher In regression models patterns in the residual (predicted � actual) versus predicted plot can indicate deficiencies in the model. Classic examples: Not having a quadratic term when the data requires it gives a parabolic plot If there is heteroskedasticity the plot of residuals will have a fan shape See, for example, Wikipedia

24. 3. Distributions of scored values Post modelling test Given the model, and distribution of values in the score data, what is the distribution of the scored values?

27. 4. Delving into innards of SAS/EM Extracting the significant variables determined from a Variable Selection node Multiple models Alternative to a lot of copying and pasting Use �EMWS<n>.Varsel_effectds or Varsel_importnc R2 or relative importance Added traffic lighting for extra information

29. Issues Order of graphs 1, 10, 11 ... 19, 2, 21 etc Lack of error does not imply error free run

30. A quote �All models are wrong but some models are useful� chapter of a 1979 book by George E. Box, industrial statistician also attributed to W. Edwards Deming

31. An ideal model Compare distributions in Training and Score checking variables correctly assigned Compare again post Variable Selection Examine residuals from each model Examine distribution of scored values

33. Acknowledgments Staff in the Tax Office SAS Technical Support some Code Node features not fully documented how to re-run a node when nothing other than the code has changed (in development mode) toggle �Use Priors� [Property Sheet -> Advanced] how to compare distributions after a Variable Selection node still issues with the width of code.

Extending SAS

Extending SAS

Presentation Transcript

Extending LANs

SAS Statistics with SAS package

Extending Eclipse

Extending System.Xml

SAS

Extending MATLAB

SAS

SAS

Extending Python

Extending UML

Extending AJDT

SAS

Extending differentiation

SAS

Extending Tiny

SAS Courses & SAS Classes

Sas

Extending Tiny

Extending Alignments

SAS