Notes for SAS programming

Notes for SAS programming Econ424 Fall 2009

Why SAS? • Able to process large data set(s) • Easy to cope with multiple variables • Able to track all the operations on the data set(s) • Generate systematic output • Summary statistics • Graphs • Regression results • Most government agencies and private sectors use SAS

Where to find SAS? • The MS Window version of SAS is only available on campus • class website, “computer resources”, “On-campus computer lab” • list computer lab location, hours and software • You can access SAS remotely via glue.umd.edu, but you cannot use the interactive windows • use a secured telnet (e.g. ssh) to remotely login glue.umd.edu with your directory ID and password • type “tap sas” to tell the system that you want to access SAS • use any text editor (say pico) to edit your sas program (say myprog.sas). You can also create the text-only .sas file in your PC and (securely) ftp it into glue. • In glue, type “sas myprog.sas &” will send the sas program to run in the background. • The computer will automatically generate myprog.log to tell you how each command runs in sas. If your program produces any output, the output will be automatically saved in myprog.lst. All in the same directory as your .sas file.

Roadmap • Thinking in “SAS” • Basic rules • Read in data • Data cleaning commands • Summary statistics • Combine two or more datasets • Hypothesis testing • Regression

Thinking in “SAS” • What is a program? • Algorithm, recipe, set of instructions • How is programming done in SAS? • SAS is like programming in any language: • Step by step instructions • Can create your own routines to process data • Most instructions are set up in a logical manner • SAS is NOT like other languages: • Some syntax is peculiar to SAS • Written specifically for statistics so it isn’t all-purpose • Canned processes that you cannot edit nor can you see the code

Thinking in “SAS” • Creating a program • What is your problem? (take project 3 as an example) • How can you find a solution? • What steps need to be taken to find an answer? • Do I need to read in data? • What variables do I need? • Where is the data? • What format is the data in? • How do I need to clean the data? • Are there outliers? • Are there any unexpected values in the data? • How do I need to transform the data? • Are the variables in the form that I need?

Basic rules (1) – organize files • .sas – program file • .log – notes, errors, warnings • .lst – output • .sas7bdat – data file • library – a cabinet to put data in • Default: Work library • temporary, erased after you close the session • Permanent library • libname mylib “m:\”; • mylib.mydata = a sas data file named “mydata” in library “mylib” • run and recall .sas

Basic rules (2) -- program • every command ends with ; • format does not matter if x=1 then y=1; else y=2; is the same as if x=1 then y=1; else y=2; • case insensitive • comment * this is comment; /* this is comment */;

Basic rule (3) – variable • Type • numeric (default, 8 digit, . stands for missing value) • character ($, default 8 digit, blank stands for missing) • Variable names • <=32 characters if SAS 9.0 or above • <=8 characters if SAS 8 or below • case insensitive • Must start with letter or “_” _name, my_name, zip5, u_and_me -name, my-name, 5zip, per%, u&me, my@w, my$sign

Basic rules (4) – data step create a new data set called “newdata” in the temporary library • Data step DATA newdata; set proj3rawdata; fracuninsured=uninsured/total; percentuninsured=fracuninsured*100; run; use the data set called “proj3rawdata” in the temporary library Define new variables input data “proj3rawdata” obs1 obs2 … obs n output data “newdata” obs1 … obs n define “fracuninsured” define “percentuninsured”

Basic rules (5) – proc step • PROC step PROC PRINT data=newdata; var fracuninsured percentuninsured; title “print out new data”; run; Action Data source Signal the end of PROC step, could be ignored if this is followed by a Data or Proc step

Read in data (1) – table editor • Tools – table editor • choose an existing library or define a new library • rename variable • save as

Read in data (2) – by program • Data format • Datalines (enter the data in the program) • Existing data text, comma or tab delimited • Existing data text, fixed width • From an existing excel file

The following several slides won’t be covered in class. But you are welcome to use them by yourselves.

Read in data (2) -- datalines data testdata1; infile datalines; input id height weight gender $ age; datalines; 1 68 144 M 23 2 78 . M 34 3 62 99 F 37 ; /* you only need one semicolon at the end of all data lines, but the semicolon must stand alone in one line */ proc contents data=testdata1; run; proc print data=testdata1; run; No ; until you finish all the data lines

Read in data (3) – more datalines /* read in data in fixed columns */ data testdata1; infile datalines; input id 1 height 2-3 weight 4-6 gender $7 age 8-9; datalines; 168144M23 278 M34 36299 F37 ;

Read in data (4) – more datalines data testdata1; infile datalines; input id : 1. height : 2. weight : 3. gender : $1. age : 2.; datalines; 1 68 144 M 23 2 78 . M 34 3 62 99 F 37 ;

Read in data (4) – more datalines *alternatively; data testdata1; infile datalines; informat id 1. height 2. weight 3. gender $1. age 2.; input id height weight gender age; datalines; 1 68 144 M 23 2 78 . M 34 3 62 99 F 37 ;

Read in data (5) – .csv data testdata1; infile datalines dlm=‘,’ dsd missover; /* what if you do not have dsd and/or missover */ input id height weight gender $ age ; datalines; 1, 68, 144, M, 23 2, 78, , M, 34 3, 62, 99, F, 37 ; /* what if you forget to type 23 */ run;

Read in data (6) – .csv file /* save the project 3 cleaned raw data in a comma delimited file (.csv) in M:\ before running the following codes */ libname mylib "M:\"; filename mycsv "M:\project3-rawdata-cleaned.csv"; data mylib.proj3rawdata; infile mycsv firstobs=2 dlm=',' dsd missover lrecl=900; input year: 4. state: $2. total insured uninsured; run; proc contents data=mylib.proj3rawdata; run; proc print data=mylib.proj3rawdata; run; Note the options in the infile command! Restrict your filename <=8 characters Be consistent with yourself Be consistent with yourself

Read in data (7) – from excel filename myexcel “M:\project3-rawdata-cleaned.xls”; proc import datafile=myexcel out=proj3rawdata2 DBMS=excel replace; run; data proj3rawdata3; set proj3rawdata2; fracuninsured=uninsured/total; percentuninsured=fracuninsured*100; run; proc contents data=proj3rawdata2; run; proc print data=proj3rawdata2; run; proc contents data=proj3rawdata3; run; proc print data=proj3rawdata3; run; No ; until the end of the whole sentence Be consistent with yourself

Read in data (7) – from excel Be careful ! SAS will read the first line as variable names, and assume the raw data start from the second row. SAS assigns numeric and character type automatically. Sometime it does make mistake.

Data cleaning (1) – if then Format: IF condition THEN action; ELSE IF condition THEN action; ELSE action; Note: • the if-then-else can be nested as many as you want • if you need multiple actions instead of one action, use “DO; action1; action2; END; ”

Data cleaning (1) – if then • = or EQ means equals • ~= or NE means not equal • > or GT means greater than • < or LT means less than • >= or GE means greater than or equal • <= or LE means less than or equal • in means subset • if gender in (‘M’, ‘F’) then ..; • Multiple conditions: AND (&), OR(|)

Data cleaning (1) – if then *reading in program of proj3rawdata3 is on page 21; data proj3rawdata3; set proj3rawdata3; IF fracuninsured<0.15 THEN uninsuregrp=0; ELSE uninsuregrp=1; run; proc contents data=proj3rawdata3; run; proc print data=proj3rawdata3; run; Note: (1)the code is less efficient if you replace ELSE ..; with IF fracuninsured>=0.15 THEN ..; (2)missing value is always counted as the smallest negative, so fracuninsured=. will satisfy the condition fracuinsured<0.15. If you want to ignore the missing obs set the condition as 0<=fracuninsured<0.15.

Data cleaning (1) – if then * Multiple actions in each branch; data proj3rawdata3; set proj3rawdata3; IF fracuninsured<0.15 AND uninsured>1000000 THEN DO; uninsuredgrp=0; uninsuredpop=‘over 1 million'; END; ELSE DO; uninsuredgrp=1; uninsuredpop=‘less than 1 million'; END; run; proc print data=proj3rawdata3; run; the do-end pair acts as brackets

Data cleaning (1) – if then *Use if commands to choose a subsample; data proj3subsample; /* note here we generate a new data set */ set proj3rawdata3; IF fracuninsured=. Then delete; If fracuninsured<=0.1; run; proc print data=proj3subsample; run;

Data cleaning (1) – exercise still use proj3rawdata. define newgrp = 1 if fracuninsured <0.1 (low) 2 if 0.1<=fracuninsured<0.15 (mid-low) 3 if 0.15<=fracuninsured<0.2 (mid-high) 4 if fracuninsured>=0.2 (high).

Data cleaning (1) – exercise answer data proj3rawdata3; set proj3rawdata3; if fracuninsured<0.1 then newgrp=1; else if fracuninsured<0.15 then newgrp=2; else if fracuninsured<0.2 then newgrp=3; else newgrp=4; run; proc contents data=proj3rawdata3; run; proc print data=proj3rawdata3; run; Question:What if one observation has fracuninsured=.?

Save data * Save in sas format; libname mylib “M:\”; data mylib,proj3rawdata3; set proj3rawdata3; run; * Export data to excel; Proc export data=proj3rawdata3 outfile=“M:\proj3data-fromsas.xls” dbms=excel replace; Run; You can also export a sas data file into a comma delimited text file if you write dbms=csv. No ; here

Pages 31-34 are optional material for data cleaning. They are not required, but you may find them useful in the future.We skip them in the regular class.

Data cleaning (2) – convert variable type Numeric to character: age1=put(age, $2.); age and age1 have the same contents but different formats Character to numeric: age2=input(age1, 1.); now age and age2 are both numeric, but age2 is chopped at the first digit Take a sub string of a character age3=substr(age1,2,1); now age3 is a sub string of age1, starting from the second digit of age1 (the meaning of “2”) and having one digit in total (the meaning of “1”).

Data cleaning (2) - example * we want to convert studid 012345678 to 012-345-678; data testdata2; infile datalines; input studid : 9. studname : $1.; datalines; 012345678 A 135792468 B 009876543 C ; proc print; run; data testdata2; set testdata2; if studid LT 1E+7 then studid1= '00’||compress(put(studid, $9.)); else if 1E+7 LE studid LT 1E+8 then studid1='0'||compress(put(studid, $9.)); else studid1= put(studid, $9.); studid2=substr(studid1,1,3)||'-'||substr(studid1,4,3)||'-'||substr(studid1,7,3); proc print; run;

Data cleaning (2) - exercise You have the following data, variables in sequence are SSN, score1, score2, score3, score4, score5: 123-45-6789 100 98 96 95 92 344-56-7234 69 79 82 65 88 898-23-1234 80 80 82 86 92 Calculate the average and standard deviation of the five scores for each individual. Use if-then command to find out who has the highest average score, and report his SSN without dashes.

Data summary roadmap • Proc contents – variable definitions • Proc print – raw data • Proc format – make your print look nicer • Proc sort – sort the data • Proc means – basic summary statistics • Proc univariate – detailed summary stat • Proc freq – frequency count • Proc chart – histogram • proc plot – scatter plot

proc format (1) *Continue the data cleaning exercise on page 29; data proj3rawdata3; set proj3rawdata3; if fracuninsured<0.1then newgrp=1; else if fracuninsured<0.15 then newgrp=2; else if fracuninsured<0.2 then newgrp=3; else newgrp=4; run; Defining “group” as a numeric variable will save space

proc format (2) proc format; value newgroup 1=‘low’ 2=‘mid-low’ 3=‘mid-high’ 4=‘high’; run; proc print data=proj3rawdata3; format newgrp newgroup. fracuninsured 4.2; title ‘report new group in words’; label fracuninsured=‘fraction uninsured’ newgrp=‘new group 1-4’; var fracuninsured newgrp; run; no ; here until the end of the value command no ; here until the end of the label command

proc sort proc sort data=proj3rawdata3; by year state; run; proc sort data=proj3rawdata3 out=proj3rawdata3_sorted; by year descending fracuninsured; run; * note that missing value is always counted as the smallest;

proc means and proc univariate By default, proc means report mean, stdev, min, max Could choose what to report: proc means data=proj3rawdata3 n mean median; proc means data=proj3rawdata3; class newgrp; var insured uninsured fracuninsured; run; proc sort data=proj3rawdata3; by newgrp; run; proc univariate data=proj3rawdata3; by newgrp; var insured uninsured fracuninsured; run; By default, proc univariate report median, and many other statistics

Notes on proc means and proc univariate *if you do not use class or by command, the statistics are based on the full sample. If you use class or by var x, the statistics are based on the subsample defined by each value of var x. *You can use class or by in proc means, but only by in proc univariate; *whenever you use “by var x”, the data set should be sorted by var x beforehand;

proc means and proc univariateallow multiple groups data proj3rawdata3; set proj3rawdata3; if totalpop<6000000 then popgrp=“low”; else popgrp=“high”; run; proc means data=proj3rawdata3; class newgrp popgrp; var fracuninsured; run; proc sort data=proj3rawdata3; by newgrp popgrp; run; proc univariate data=proj3rawdata3; by newgrp popgrp; var fracuninsured; run;

proc freq * Remember we already generate a variable called newgrp to indicate categories of fraction uninsured and a variable called popgrp to indicate categories of population size; proc freq data=proj3rawdata3; tables newgrp popgrp newgrp*popgrp; run; One dimension frequency table Two-dimension frequency table

proc chart – histogram for categorical variables proc chart data=proj3rawdata3; title ‘histogram for newgrp’; vbar newgrp; run; proc chart data=proj3rawdata3; title ‘frequency by two variables’; vbar newgrp / group=popgrp; run;

proc chart – histogram for continuous variable proc chart data=proj3rawdata3; title “histogram for continuous variable’; vbar fracuninsured; run; proc chart data=proj3rawdata3; title ‘histogram with specific midpoints’; vbar fracuninsured / midpoints=0 to 1 by 0.05; run;

proc plot – scatter plot proc plot data=proj3rawdata3; title ‘scatter plot of fracuninsured and totalpop’; plot fracuninsured*totalpop; run;

scatter plot is less informative for categorical variables proc plot data=proj3rawdata3; title ‘scatter plot of newgrp and popgrp’; plot newgrp*popgrp; run;

fancy proc means proc means data=proj3rawdata3; class newgrp popgrp; var uninsured fracuninsured; output out = summary1 mean = avguninsured avgfracuninsured; run; proc print data=summary1; run;

The following page may be useful in practice, but I am not going to cover it in class.

some summary stat. in proc print * Assume we have already defined newgrp and popgrp in proj3rawdata3; proc sort data=proj3rawdata3; by popgrp; run; proc print data=proj3rawdata3 n; where fracuninsured>=0.1; by popgrp; sum totalpop; var totalpop insured uninsured fracuninsured; run;

How to handle multiple data sets? • Add more observations to an existing data and the new observations follow the same data structure as the old one  append • Add more variables to an existing data and the new variables refer to the same subjects as in the old data  merge • Sometimes we may need to change data structure to fit in append or merge ….

Notes for SAS programming