Notes for SAS programming

Notes for SAS programming Econ424 Fall 2007

Why SAS? • Able to process large data set(s) • Easy to cope with multiple variables • Able to track all the operations on the data set(s) • Generate systematic output • Summary statistics • Graphs • Regression results • Most government agencies and private sectors use SAS

Roadmap • Basic rules • Read in data • Data cleaning commands • Summary statistics • Combine two or more datasets • Hypothesis testing • Regression

Basic rules (1) – organize files • .sas – program file • .log – notes, errors, warnings • .lst – output • .sas7bdat – data file • library – a cabinet to put data in • Default: Work library • temporary, erased after you close the session • Permanent library • libname mylib “m:\”; • mylib.mydata = a sas data file named “mydata” in library “mylib” • run and recall .sas

Basic rules (2) -- program • every command ends with ; • format does not matter if x=1 then y=1; else y=2; is the same as if x=1 then y=1; else y=2; • case insensitive • comment * this is comment; /* this is comment */;

Basic rule (3) – variable • Type • numeric (default, 8 digit, . stands for missing value) • character ($, default 8 digit, blank stands for missing) • Variable names • <=32 characters if SAS 9.0 or above • <=8 characters if SAS 8 or below • case insensitive • Must start with letter or “_” _name, my_name, zip5, u_and_me -name, my-name, 5zip, per%, u&me, my@w, my$sign

input data “distance” obs1 obs2 … obs n output data “distancenew” obs1 … obs n define “miles” define “kilometers” Basic rules (4) – data step • Data step DATA distancenew; set distance; miles=26.22; kilometers=1.61*miles;

Basic rules (5) – proc step • PROC step PROC PRINT data=distancenew; var miles kilometers; title “print out distancenew”; run;

Read in data (1) – table editor • Tools – table editor • choose an existing library or define a new library • rename variable • save as

Read in data (2) -- datalines data testdata1; infile datalines; input id height weight gender $ age; datalines; 1 68 144 M 23 2 78 . M 34 3 62 99 F 37 ; /* you only need one semicolon at the end of all data lines, but the semicolon must stand alone in one line */ proc contents data=testdata1; run; proc print data=testdata1; run; No ; until you finish all the data lines

Read in data (3) – more datalines /* read in data in fixed columns */ data testdata1; infile datalines; input id 1 height 2-3 weight 4-6 gender $7 age 8-9; datalines; 168144M23 278 M34 36299 F37 ;

Read in data (4) – more datalines data testdata1; infile datalines; input id : 1. height : 2. weight : 3. gender : $1. age : 2.; datalines; 1 68 144 M 23 2 78 . M 34 3 62 99 F 37 ;

Read in data (4) – more datalines *alternatively; data testdata1; infile datalines; informat id 1. height 2. weight 3. gender $1. age 2.; input id height weight gender age; datalines; 1 68 144 M 23 2 78 . M 34 3 62 99 F 37 ;

Read in data (5) – .csv data testdata1; infile datalines dlm=‘,’ dsd missover; /* what if you do not have dsd and/or missover */ input id height weight gender $ age ; datalines; 1, 68, 144, M, 23 2, 78, , M, 34 3, 62, 99, F, 37 ; /* what if you forget to type 23 */ run;

Read in data (6) – .csv file /* save the midterm data in a comma delimited file (.csv) before running the following codes */ libname mylib "M:\"; filename midterm "M:\midterm-fall2005-1997rank.csv"; data mylib.midterm; infile midterm firstobs=3 dlm=',' dsd missover lrecl=900; input year: 4. name: $50. score reputation accept_rate graduate_rate; run; proc contents data=mylib.midterm; run; proc print data=mylib.midterm; run; Note the options in the infile command!

Read in data (7) – from excel filename myexcel “M:\midterm-fall2005-1997rank.xls”; proc import datafile=myexcel out=readin1 DBMS=excel replace; run; data readin1; set readin1; acceptper=accept_rate*100; gradper=graduate_rate*100; run; proc contents data=readin1; run; proc print data=readin1; run; No ; until the end of the whole sentence

Read in data (7) – from excel Be careful ! SAS will read the first line as variable names, our original data has two non-data lines – the first as labels, the second as variable names. You need to delete the first line to make sure correct data input into SAS. What happens if you don’t delete the first line? What happens if you delete the second line? SAS assigns numeric and character type automatically. Sometime it does make mistake.

Data cleaning (1) – if then Format: IF condition THEN action; ELSE IF condition THEN action; ELSE action; Note: • the if-then-else can be nested as many as you want • if you need multiple actions instead of one action, use “DO; action1; action2; END; ”

Data cleaning (1) – if then • = or EQ means equals • ~= or NE means not equal • > or GT means greater than • < or LT means less than • >= or GE means greater than or equal • <= or LE means less than or equal • in means subset • if gender in (‘M’, ‘F’) then ..; • Multiple conditions: AND (&), OR(|)

Data cleaning (1) – if then *reading in program of testdata1 is on page 14; data testdata1; set mylib.testdata1; IF gender='M' AND age<=25 THEN testgroup=0; ELSE testgroup=1; run; proc contents data=testdata1; run; proc print data=testdata1; run; Note: (1)the code is less efficient if you replace ELSE ..; with IF gender~= 'M' OR age>25 THEN ..; (2)missing value is always counted as the smallest negative, so age=. will satisfy the condition age<=25. If you want to ignore the obs with age=. set the condition as 0<age<=25.

Data cleaning (1) – if then * Multiple actions in each branch; data testdata1; set testdata1; IF gender='M' AND age<=25 THEN DO; group=0; activity='ballgame'; END; ELSE DO; group=1; activity='movie'; END; run; proc print data=testdata1; run;

Data cleaning (1) – if then *Use if commands to choose a subsample; data testdata2; /* note here we generate a new data set */ set testdata1; IF age=. Then delete; If age>18 & weight~=.; run; proc print data=testdata2; run;

Data cleaning (1) – exercise still use testdata1. define weight in pounds bmi=--------------------------------- x 703 (height in inches)^2 define group = 1 if bmi<18.5 (underweight) 2 if 18.5 <= bmi < 25 (normal) 3 if 25<= bmi <30 (overweight) 4 if bmi>=30 (obese).

Data cleaning (1) – exercise answer data testdata1; set testdata1; bmi=(weight/(height*height))*703; if bmi<18.5 then group=1; else if bmi<25 then group=2; else if bmi<30 then group=3; else group=4; run; proc contents data=testdata1; run; proc print data=testdata1; run; Question:What if one observation has weight=.?

Pages 24-27 are optional material for data cleaning. They are not required, but you may find them useful in the future.We skip them in the regular class.

Data cleaning (2) – convert variable type Numeric to character: age1=put(age, $2.); age and age1 have the same contents but different formats Character to numeric: age2=input(age1, 1.); now age and age2 are both numeric, but age2 is chopped at the first digit Take a sub string of a character age3=substr(age1,2,1); now age3 is a sub string of age1, starting from the second digit of age1 (the meaning of “2”) and having one digit in total (the meaning of “1”).

Data cleaning (2) - example * we want to convert studid 012345678 to 012-345-678; data testdata2; infile datalines; input studid : 9. studname : $1.; datalines; 012345678 A 135792468 B 009876543 C ; proc print; run; data testdata2; set testdata2; if studid LT 1E+7 then studid1= '00’||compress(put(studid, $9.)); else if 1E+7 LE studid LT 1E+8 then studid1='0'||compress(put(studid, $9.)); else studid1= put(studid, $9.); studid2=substr(studid1,1,3)||'-'||substr(studid1,4,3)||'-'||substr(studid1,7,3); proc print; run;

Data cleaning (2) - exercise You have the following data, variables in sequence are SSN, score1, score2, score3, score4, score5: 123-45-6789 100 98 96 95 92 344-56-7234 69 79 82 65 88 898-23-1234 80 80 82 86 92 Calculate the average and standard deviation of the five scores for each individual. Use if-then command to find out who has the highest average score, and report his SSN without dashes.

Data summary roadmap • Proc contents – variable definitions • Proc print – raw data • Proc format – make your print look nicer • Proc sort – sort the data • Proc means – basic summary statistics • Proc univariate – detailed summary stat • Proc freq – frequency count • Proc chart – histogram • proc plot – scatter plot

proc format (1) *Continue the data cleaning exercise on page 23; data testdata1; set testdata1; bmi=(weight/(height*height))*703; if bmi<18.5 then group=1; else if bmi<25 then group=2; else if bmi<30 then group=3; else group=4; run; Defining “group” as a numeric variable will save space

proc format (2) proc format; value bmigroup 1=‘under weight’ 2=‘normal weight’ 3=‘overweight’ 4=‘obese’; run; proc print data=testdata1; format group bmigroup. bmi 2.1; title ‘report bmi group in words’; label bmi=‘body mass index’ group=‘bmi group 1-4’; var bmi group; run; no ; here until the end of the value command no ; here until the end of the label command

proc sort proc sort data=testdata1; by gender age; run; proc sort data=testdata1 out=testdata2; by gender descending age; run; * note that missing value is always counted as the smallest;

proc means and proc univariate By default, proc means report mean, stdev, min, max Could choose what to report: proc means data=testdata1 n mean median; proc means data=testdata1; class gender; var height weight bmi; run; proc sort data=testdata1; by gender; run; proc univariate data=testdata1; by gender; var bmi; run; By default, proc univariate report median, and many other statistics

Notes on proc means and proc univariate *if you do not use class or by command, the statistics are based on the full sample. If you use class or by var x, the statistics are based on the subsample defined by each value of var x. *You can use class or by in proc means, but only by in proc univariate; *whenever you use “by var x”, the data set should be sorted by var x beforehand;

More example onproc means and proc univariate data readin1; set readin1; if accept_rate<0.20 then acceptgrp=“low”; else acceptgrp=“high”; run; proc means data=readin1; class acceptgrp; var reputation accept_rate graduate_rate; run; proc sort data=readin1; by acceptgrp; run; proc univariate data=readin1; by acceptgrp; var reputation; run; By default, proc means report mean, stdev, min, max Could choose what to report: proc means data=readin1 n mean median; By default, proc univariate report median, and many other statistics

proc freq data readin1; set readin1; if graduate_rate<0.5 then gradgrp=“low”; else if 0.5<=graduate_rate<0.8 then gradgrp=‘middle’; else gradgrp=‘high’; run; proc freq data=readin1; tables acceptgrp gradgrp acceptgrp*gradgrp; run;

proc chart -- histogram proc chart data=readin1; title ‘histogram for acceptgrp’; vbar acceptgrp; run; proc chart data=readin1; title ‘frequency by two variables’; vbar acceptgrp / group=gradgrp; run;

proc chart – continuous variable proc chart data=readin1; title “histogram for continuous variable’; vbar accept_rate; run; proc chart data=readin1; title ‘histogram with specific midpoints’; vbar accept_rate / midpoints=0 to 1 by 0.1; run;

proc plot – scatter plot proc plot data=readin1; title ‘scatter plot of accept rate and grad rate’; plot accept_rate*graduate_rate; run;

fancy proc means proc means data=readin1; class acceptgrp gradgrp; var score reputation; output out = summary1 mean = avgscore avgreputn; run; proc print data=summary1; run;

The following page may be useful in practice, but I am not going to cover it in class.

some summary stat. in proc print filename myexcel “M:\midterm-fall2005-1997rank.xls”;; proc import datafile=myexcel out=readin1 DBMS=excel replace; run; data readin1; set readin1; if accept_rate<0.2 then acceptgrp=‘low’; else if 0.2<=accept_rate<0.5 then acceptgrp=‘middle’; else acceptgrp=‘high’; run; proc sort data=readin1; by acceptgrp; run; proc print data=readin1 n; where graduate_rate>=0.8; by acceptgrp; sum score; var name score accept_rate graduate_rate; run;

merge and append readin1: name score reputation acceptgrp gradgrp Harvard 100 3.9 low high summary1: acceptgrp gradgrp avgscore avgreputn low high 80 3.3 merged: name score reputation acceptgrp gradgrp avgscore avgreputn Harvard 100 3.9 low high 80 3.3 appended: name score reputation acceptgrp gradgrp avgscore avgreputn Harvard 100 3.9 low high . . . . low high 80 3.3

merge two datasets proc sort data=readin1; by acceptgrp gradgrp; run; proc sort data=summary1; by acceptgrp gradgrp; run; data merged; merge readin1 (in=one) summary1 (in=two); by acceptgrp gradgrp; if one=1 & two=1; run; What if this line is “if one=1 OR two=1;”?

Keep track of matched and unmatched records data allrecords; merge readin1 (in=one) summary1 (in=two); by acceptgrp gradgrp; myone=one; mytwo=two; if one=1 or two=1; run; proc freq data=allrecords; tables myone*mytwo; run; SAS will drop variables “one” and “two” automatically at the end of the DATA step. If you want to keep them, you can copy them into new variables “myone” and “mytwo”

be careful about merge! • always put the merged data into a new data set • must sort by the key variables before merge • ok for one-to-one, multi-to-one, one-to-multi, but no good for multi-to-multi • be careful of what records you want to keep, and what records you want to delete • what if variable x appears in both datasets, but x is not in the “by” statement? • after the merge x takes the value defined in the last dataset of the “merge” statement

append data appended; set readin1 summary1; run; proc print data=appended; run; proc print data=merged; run;

Application : Project 5 rest_rating: Restname Ratings Price Loc1 Phone1 Loc2 Phone2 Addie’s 1 30 Rockville 301-881-0081 Andalucia 2 30 Rockville 301-770-1880 Haandi 2 10 Bethesda 301-718-0121 Falls Church 703-533-3501 city_demo: Loc State Pop Income Bethesda MD 62936 38976 Falls Church VA 9578 26709 Rockville MD 44835 21484 How to merge the two data sets?

Steps to merge the datasets • reshape data set 1 – focus on loc1 rest_rating: Restname Ratings Price Loc1 Phone1 Loc2 Phone2 Addie’s 1 30 Rockville 301-881-0081 Andalucia 2 30 Rockville 301-770-1880 Haandi 2 10 Bethesda 301-718-0121 Falls Church 703-533-3501 restloc1 restname .. loc phone Addie’s Rockville 301-881-0081 Andalucia Rockville 301-770-1880 Haandi Bethesda 301-718-0121 data restloc1; set rest_rating; loc=loc1; phone=phone1; drop loc1 loc2 loc3 loc4 phone1 phone2 phone3 phone4; run;

Steps to merge the datasets • reshape data set 1 – focus on loc2 rest_rating: Restname Ratings Price Loc1 Phone1 Loc2 Phone2 Addie’s 1 30 Rockville 301-881-0081 Andalucia 2 30 Rockville 301-770-1880 Haandi 2 10 Bethesda 301-718-0121 Falls Church 703-533-3501 restloc2 restname .. loc phone Haandi Falls Church 703-533-3501 data restloc2; set rest_rating; loc=loc2; phone=phone2; drop loc1 loc2 loc3 loc4 phone1 phone2 phone3 phone4; If loc=‘’ then delete; run;

Notes for SAS programming