1 / 32

Extending SAS

nysa
Download Presentation

Extending SAS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Extending SAS/Enterprise Miner a users experience

    2. Outline Simulated and public domain data Portable SAS Code Nodes problem independent Still in development not as robust as professional nodes One example works outside of EM accessing the information stored within EM

    3. Dictionary

    4. Score vs Training data We often want to make a prediction in a Score dataset where we do not know the outcome based on a model derived from a Training dataset where we do. Important that the Training data should represent the Score data. Often impossible exactly, so should understand if there are differences.

    5. Simulated and public domain data Simulated Target = x + e(0,s) Target = x + x2 + e(0,s) Target = x + e(0,x s) Boston Housing data http://lib.stat.cmu.edu/datasets/ relatively small

    6. 1. Comparing distributions Visual comparison of distributions on Training and Score data SAS code node problem independent Portable To avoid the problem of a model that does not generalise Incorrect assignment of variables interval when should be nominal most common StatExplore node can do some of the tasks described here, but not as well.

    7. Outline of code Sample to min (10,000 size of smallest dataset) For interval variables plot a cumulative distribution plot Proc Sort, Datastep, Proc GPLOT For binary/nominal/ordinal use ANNOTATE to plot dot graphs if necessary restricting to the 10 most common levels check for levels absent from one of the sources

    12. 2. Examining residuals from models SAS code node Current code is restricted to models with an Interval target variable, otherwise portable 7 outputs residual squared, and smoothed [KDE]

    13. 2. Examining residuals from models, continued Density of residuals, [KDE and UNIVARIATE] proc univariate data = residuals; var %em_residual; histogram %em_residual/normal name = 'plot4b'; title2 "using PROC UNIVARIATE"; Run; Residual vs Predicted, and smoothed PROCs KDE and GCONTOUR Boxplot of residuals PROCs Rank, Sort, Boxplot

    14. Statistics 101 refresher In regression models patterns in the residual (predicted actual) versus predicted plot can indicate deficiencies in the model. Classic examples: Not having a quadratic term when the data requires it gives a parabolic plot If there is heteroskedasticity the plot of residuals will have a fan shape See, for example, Wikipedia

    24. 3. Distributions of scored values Post modelling test Given the model, and distribution of values in the score data, what is the distribution of the scored values?

    27. 4. Delving into innards of SAS/EM Extracting the significant variables determined from a Variable Selection node Multiple models Alternative to a lot of copying and pasting Use EMWS<n>.Varsel_effectds or Varsel_importnc R2 or relative importance Added traffic lighting for extra information

    29. Issues Order of graphs 1, 10, 11 ... 19, 2, 21 etc Lack of error does not imply error free run

    30. A quote All models are wrong but some models are useful chapter of a 1979 book by George E. Box, industrial statistician also attributed to W. Edwards Deming

    31. An ideal model Compare distributions in Training and Score checking variables correctly assigned Compare again post Variable Selection Examine residuals from each model Examine distribution of scored values

    33. Acknowledgments Staff in the Tax Office SAS Technical Support some Code Node features not fully documented how to re-run a node when nothing other than the code has changed (in development mode) toggle Use Priors [Property Sheet -> Advanced] how to compare distributions after a Variable Selection node still issues with the width of code.

More Related