DATA STEP by DATA STEP you’ll go far

# DATA STEP by DATA STEP you’ll go far

## DATA STEP by DATA STEP you’ll go far

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. DATA STEP by DATA STEPyou’ll go far Aaron J. Rabushka Statistical Programmer INC Research, Inc. Austin, TX

2. Note that the code examples in this presentation were developed and run under SAS V9.2 (TS2M2), and that the coloring in the displays of code and results comes from the author without necessarily representing actual SAS displays.

3. A FEW BASICS The SAS system offers a halfway house between canned routines and procedural programming. Its pre-programmed procedures save a lot of work and time since programmers do not have to re-code standardized and routinized procedures and utilities every time they use them.

4. A FEW BASICS Most SAS code goes into STEPs, either DATA STEPs or PROC (PROCedure) STEPs. OPEN CODE refers to instructions not associated with either of these (e.g., OPTIONS statements). Some DATA STEP features will seem very familiar to procedural programmers, and some will seem annoyingly foreign.

5. A FEW BASICS Every SAS DATA STEP begins with the word DATA. Note that in this instance it is not followed by an equal sign as DATA= references an already existing data set during the course of a PROC statement.

6. A FEW BASICS SAS has two data types, NUMERIC and CHARACTER. SAS users derive all of their variables from these two types. SAS does not have special types for LOGICAL or DATE fields.

7. A FEW BASICS If the programmer does not name a dataset in the DATA statement the system will name it as DATA with a sequence number appended.

8. data; x = 1; output; run; data; y = 10; output; run;

9. 1 data; 2 x = 1; 3 output; 4 run; NOTE: The data set WORK.DATA1has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.04 seconds cpu time 0.01 seconds 5 data; 6 y = 10; 7 output; 8 run; NOTE: The data set WORK.DATA2 has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds

10. A FEW BASICS If the programmer does not name a dataset in the DATA statement the system will name it as DATA with a sequence number appended. This practice is not recommended as it can result in world-class confusions.

11. A FEW BASICS SAS dataset names officially have two parts, a library name and a data set name. A period separates the two. If a programmer does not specify a library name for a dataset the SAS system will attach WORK. to the dataset name that he assigns. The programmer does not need to articulate WORK. in the code.

12. data demonstration; x = 1; output; run;

13. 10 data demonstration; 11 x = 1; 12 output; 13 run; NOTE: The data set WORK.DEMONSTRATION has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds

14. A FEW BASICS SAS dataset names officially have two parts, a library name and a data set name. A period separates the two. If a programmer does not specify a library name for a dataset the SAS system will attach WORK. to the dataset name that he assigns. The programmer does not need to articulate WORK. in the code. WORK. files disappear when the SAS session ends.

15. A FEW BASICS SAS datasets that need to be saved or that have been saved into libraries from previous SAS sessions need to have both their dataset names and their library names articulated every time the program references them. The programmer must declare library names with LIBNAME before using them in this way.

16. *NOTE THAT LIBRARY DEFINITIONS ARE OPERATING-SYSTEM SPECFIC; libnameajrdata "h:\"; data ajrdata.demonstration; x = 1; output; run;

17. 17 libnameajrdata "h:\"; NOTE: LibrefAJRDATA was successfully assigned as follows: Engine: V9 Physical Name: h:\ 18 19 20 data ajrdata.demonstration; 21 x = 1; 22 output; 23 run; NOTE: The data set AJRDATA.DEMONSTRATION has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.21 seconds cpu time 0.00 seconds

18. A FEW BASICS Dataset names can have at most 32 characters and must start with a letter or underscore.

19. data this_is_an_example_of_a_dataset_name_that_is_too_long; x = 1; output; run;

20. 25 data this_is_an_example_of_a_dataset_name_that_is_too_long; ----------------------------------------------------- 307 ERROR 307-185: The data set name cannot have more than 32 characters. 26 x = 1; 27 output; 28 run; NOTE: The SAS System stopped processing this step because of errors. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds

21. data 123_this_will_not_work; x = 1; output; run;

22. 80 data 123_this_will_not_work; --- 22 200 ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string, /, ;, _DATA_, _LAST_, _NULL_. ERROR 200-322: The symbol is not recognized and will be ignored. 81 x = 1; 82 output; 83 run; NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set WORK._THIS_WILL_NOT_WORK may be incomplete. When this step was stopped there were 0 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.01 seconds

23. A FEW BASICS If a programmer uses the name of a dataset that already exists then SAS will simply write the new dataset over the old one of that name, without warning.

24. data one_num; x = 1; output; run; data one_num; y = 10; output; run; proc print data=one_num; title1 "one_num"; title2 "note that this contains the data"; title3 "from the second DATA ONE_NUM step"; run;

25. 65 data one_num; 66 x = 1; 67 output; 68 run; NOTE: The data set WORK.ONE_NUM has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.01 seconds 69 data one_num; 70 y = 10; 71 output; • run; • NOTE: The data set WORK.ONE_NUM has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.00 seconds

26. 74 proc print data=one_num; 75 title1 "one_num"; 76 title2 "note that this contains the data"; 77 title3 "from the second DATA ONE_NUM step"; 78 run; NOTE: There were 1 observations read from the data set WORK.ONE_NUM. NOTE: PROCEDURE PRINT used (Total process time): real time 0.01 seconds cpu time 0.00 seconds

27. one_num note that this contains the data from the second DATA ONE_NUM step Obs y 1 10

28. A FEW BASICS A DATA statement can create a single data set or multiple datasets: DATA SUBJECTS; DATA MEN WOMEN;

29. GETTING DATA INTO SAS DATASETS SAS users usually refer to records in datasets as observations. SAS DATA STEPs operate as implied loops which iterate as necessary to handle the data involved.

30. GETTING DATA INTO SAS DATASETS A programmer can assign data values directly through assignment statements.

31. data assignments; length country \$ 12; subject = 25; country = "PARAGUAY"; run;

32. 13 14 data assignments; 15 length country \$ 12; 16 subject = 25; 17 country = "PARAGUAY"; 18 run; NOTE: The data set WORK.ASSIGNMENTS has 1 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.00 seconds

33. assignments Obs country subject 1 PARAGUAY 25

34. GETTING DATA INTO SAS DATASETS A programmer can assign data values by including a DATALINES or CARDS section in a DATA STEP. Note that SAS accepts these two interchangeably even when no actual cards are involved.

35. data free_form; input age sex \$; datalines; 54 MALE 35 MALE 40 FEMALE 29 FEMALE ;;;; proc print data=free_form; title1 "data free_form"; run;

36. 1 2 data free_form; 3 input age sex \$; 4 datalines; NOTE: The data set WORK.FREE_FORM has 4 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.14 seconds cpu time 0.03 seconds 9 ;;;; 10 proc print data=free_form; 11 title1 "data free_form"; 12 run; NOTE: There were 4 observations read from the data set WORK.FREE_FORM. NOTE: PROCEDURE PRINT used (Total process time): real time 0.17 seconds cpu time 0.04 seconds

37. data free_form Obs age sex 1 54 MALE 2 35 MALE 3 40 FEMALE 4 29 FEMALE

38. GETTING DATA INTO SAS DATASETSFROM EXTERNAL FLAT OR DELIMITED FILES INFILE statements reference and describe external source files. INPUT statements direct SAS to read and incorporate the data from these external source files.

39. GETTING DATA INTO SAS DATASETSFROM EXTERNAL FLAT OR DELIMITED FILES INFILElocates and describes an external data source. The syntax of INFILEstatements varies by operating system. EXAMPLES: WINDOWS: data test; infile ‘c:\work\space\sasajr\test.dat’;

40. GETTING DATA INTO SAS DATASETSFROM EXTERNAL FLAT OR DELIMITED FILES UNIX: data test; infile ‘users/sasajr/test.dat’; MAINFRAME: //FILEIN DD DSN=YAHUPITZ.AJRDATA,DISP=SHR . . . DATA TEST; INFILE FILEIN;

41. GETTING DATA INTO SAS DATASETSFROM EXTERNAL FLAT OR DELIMITED FILES Can also use a FILENAME statement in open SAS code to refer to an external file. Also operating-system-specific. Example from Windows: FILENAME TESTDATA ‘c:\work\space\sasajr\test.dat’; . . . DATA TEST; INFILE TESTDATA;

42. GETTING DATA INTO SAS DATASETSFROM EXTERNAL FLAT OR DELIMITED FILES INFILEstatements can also describe a file as delimited with the DLM option, which identifies the delimiter used in the file in question. EXAMPLE FOR A COMMA-DELIMITED FILE: infile ‘users/sasajr/test.dat’ dlm = ‘,’; This is useful in turning .CSV files from Excel spreadsheets into SAS datasets.

43. GETTING DATA INTO SAS DATASETSFROM EXTERNAL FLAT OR DELIMITED FILES A couple of options that are useful with delimited-file INFILE statements are DSD, which will recognize missing values between two delimiters in a row, and MISSOVER, which keeps SAS from reading data from the following line if the current observation is not completely filled in. EXAMPLE FOR A COMMA-DELIMITED FILE: infile ‘users/sasajr/test.dat’ dlm = ‘,’ dsdmissover;

44. GETTING DATA INTO SAS DATASETSFROM EXTERNAL FLAT OR DELIMITED FILES Once the data source is identified with either INFILE or DATALINES, INPUT creates the variables in the resultant SAS dataset. The simplest form of an INPUT statement is often called free-form input. It does not require the data to be laid out consistently in columns. Character variables can have at most 8 characters, and cannot include spaces. Note the use of the dollar sign to indicate that a variable is character rather than numeric.

45. data free_form; input age sex \$; datalines; 54 MALE 35 MALE 40 FEMALE 29 FEMALE ;;;; proc print data=free_form; title1 "data free_form"; run;

46. 25 data free_form; 26 input age sex \$; 27 datalines; NOTE: The data set WORK.FREE_FORM has 4 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.00 seconds 32 ;;;; 33 34 proc print data=free_form; 35 title1 "data free_form"; 36 run; NOTE: There were 4 observations read from the data set WORK.FREE_FORM. NOTE: PROCEDURE PRINT used (Total process time): real time 0.01 seconds cpu time 0.01 seconds

47. data free_form Obs age sex 1 54 MALE 2 35 MALE 3 40 FEMALE 4 29 FEMALE

48. GETTING DATA INTO SAS DATASETSFROM EXTERNAL FLAT OR DELIMITED FILES If the data are column-aligned in their source you can use column pointers, which consists of an @ sign followed by a number, to indicate their placement within the source record.

49. data column_aligned; input @1 age @4 sex \$; datalines; 54 MALE 35 MALE 40 FEMALE 29 FEMALE ;;;;

50. 38 data column_aligned; 39 input @1 age @4 sex \$; 40 datalines; NOTE: The data set WORK.COLUMN_ALIGNED has 4 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 45 ;;;; 46 47 proc print data = column_aligned; 48 title1 "data column_aligned"; 49 run; NOTE: There were 4 observations read from the data set WORK.COLUMN_ALIGNED. NOTE: PROCEDURE PRINT used (Total process time): real time 0.01 seconds cpu time 0.01 seconds