1 / 46

Chapter Six: Understanding DATA Step Processing

Chapter Six: Understanding DATA Step Processing. Objectives: Identify the two phases that occur when a DATA step is processed Interpret automatics variables Identify the processing phase in which an error occurs Debug SAS DATA steps

Download Presentation

Chapter Six: Understanding DATA Step Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter Six: Understanding DATA Step Processing Objectives: • Identify the two phases that occur when a DATA step is processed • Interpret automatics variables • Identify the processing phase in which an error occurs • Debug SAS DATA steps • Test programs by limiting the number of observations that are created • Flag errors in the SAS lag

  2. 1 1 2 1---5----0----5----0 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 98212/12/00dfw 5 85 43912/13/00LAX 14196 98212/13/00DFW 15116 43112/14/00LaX 17166 98212/14/00DFW 7 88 11412/15/00LAX 187 98212/15/00DFW 14 31 Reading Raw Data Files Data for flights from New York to Dallas (DFW) and Los Angeles (LAX) are stored in a raw data file. Create a SAS data set from the raw data.

  3. Read an external raw data set and Create a Permanent SAS Data Sets The external raw data set, dfwlax.dat is saved in the following: C:\Math660\RawData folder. The new SAS data set, dfwlax.sas7bdat is created in the WORK Library. data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; NOTE: The data set DFWLAX has 10 observations and 5 variables.

  4. How SAS Processing Programs • Phase 1: Compilation phase: • Each SAS statement is scanned for syntax errors. • When compilation phase is completed, the descriptor portion of the new data set is created. • If there are syntax errors, SAS usually stop processing the next step of reading the data. • If there is no syntax error, SAS proceeds to conduct the Execution Phase. • Phase 2: Execution Phase: • Each data set is read by following the infile / input statements. • Data reading errors are checked and reported in SAS log. • Once every variable is read, the data case is saved as an observation in the newly created SAS data set. • Once all cases are processed, the execution step is completed.

  5. Compilation Phase • For external raw data set (such as .dat, .txt data types): • An Input Buffer is created to hold a data case from the external file (such as .dat, .txt files) No Input Buffer is created when read a SAS data set (such as using SET, MERGE to read a SAS data set). • A Program Data Vector is created with two automatic variables: • _N_ : for tracing the data cases in sequence. • _ERROR_: for tracing if an error has occurred during the data input. • As the INPUT statement is compiled, a slot is added to the Program data Vector for each variable, including the new variables from the assignment statement. • The attributes of the variable are determined by the expression of the statement. • At the end of the Data Step, the compilation phase is completed, the Data Descriptor portion of the data set is created, which include: • Name of the data set, • Number of observations and variables • Name and attributes of the variables • No observation is read to the Data Set at this phase until Execution Phase starts.

  6. Looking Behind the Scenes The DATA step is processed in two phases: • Compilation phase: Creates the Data Descriptor information of the SAS data set: data set attributes and variable attributes • Execution phase: Read the data records based on the instruction described in the Data Step, and creates the Data Portion of the SAS data set. Compilation Execution data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run;

  7. Looking Behind the Scenes At compile time, SAS creates • an input bufferto hold the current raw data file record that is being processed • a program data vector (PDV)to hold the current SAS observation • the descriptor portionof the output data set.

  8. Compiling the DATA Step data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; ...

  9. Compiling the DATA Step data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; Input Buffer ...

  10. Input Buffer PDV Compiling the DATA Step data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; ...

  11. Input Buffer PDV Compiling the DATA Step data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; ...

  12. Input Buffer PDV Compiling the DATA Step data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; ...

  13. Input Buffer PDV Compiling the DATA Step data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; ...

  14. Input Buffer PDV Compiling the DATA Step data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; ...

  15. Input Buffer Compiling the DATA Step data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; PDV Program Data Vector dfwlax descriptor portion ...

  16. Flow for Execution Phase: Compile Phase Initialize Variables to Missing (PDV) End of File? Execute INPUT Statement Yes Execute Other Statements Next Step No Output to SAS Data Set

  17. Input Buffer Execution Phase – when reading an External Raw Data Set data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; PDV dfwlax

  18. Execution Phase Input Buffer data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 Raw Data PDV dfwlax ...

  19. data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 Raw Data Input Buffer PDV dfwlax ...

  20. Input Buffer data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 Raw Data PDV dfwlax ...

  21. Input Buffer data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 Raw Data PDV dfwlax ...

  22. Input Buffer data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 Raw Data PDV dfwlax ...

  23. Input Buffer data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 Raw Data PDV dfwlax ...

  24. Input Buffer data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 Raw Data PDV dfwlax ...

  25. Automatic return Input Buffer Automatic output data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 Raw Data PDV dfwlax ...

  26. Input Buffer data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 Raw Data Reinitialize variables to missing PDV dfwlax ...

  27. Input Buffer data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 Raw Data PDV dfwlax ...

  28. Input Buffer data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 Raw Data PDV dfwlax ...

  29. Input Buffer data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 Raw Data PDV dfwlax ...

  30. Automatic return Automatic output data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 Raw Data Input Buffer PDV dfwlax ...

  31. Executing the DATA Step data work.dfwlax; infile 'raw-data-file'; input Flight $ 1-3 Date $ 4-11 Dest $ 12-14 FirstClass 15-17 Economy 18-20; run; 43912/11/00LAX 20137 92112/11/00DFW 20131 11412/12/00LAX 15170 Raw Data Input Buffer PDV dfwlax

  32. The automatic Variables Created during Execution Phase Two automatic variables are added to PDV at Execution Phase, but are not written to the data set. • _N_ : counts # of times the Data step begins to execute. • _ERROR_ : signals the occurrence of any error during execution phase. Default is 0, which means no error. When one or more errors occur, the value is set to 1. The variable _N_ is first initialized to be 1. Once an iteration of reading an observation is completed (executing the RUN statement), N is added by 1. The variable _ERROR_ = 0 if no error. Once it has a data execution error, it changes to 1 and retains to be 1 afterwards.

  33. When does the Execution Phase stop? Each observation is completed after executing the RUN statement or at the end of the Data Step. This is called an iteration. The process moves back to the top of the data step for next iteration. The process continues until all records in the data set are executed. SAS recognizes the execution process is completed by recognizing the END-OF_FILE_MARKER. At the end of a raw data set, there is an end-of-file-marker. The execution phase continues until the END-OF-file_marker is reached. The data portion of the new data set is complete. NOTE: The order of the variables in the new SAS data set is the order of the variables appeared in the Data Step.

  34. Execution Phase – Reading External data Compile Program Initialize Variables to Missing (PDV) End of File? Execute INPUT Statement Yes Execute Other Statements Next Step No Output to SAS Data Set

  35. Execution Phase – Reading A SAS Data Compile Program Program Data Vector is initialized only once at the beginning. Initialize Variables to Missing (PDV) End of File? Execute SET, MERGE, Statement Yes Execute Other Statements Next Step No Output to SAS Data Set

  36. When reading variables from an external raw data sets, the values of each variable in the Data Step are initialized to missing at the beginning of the execution. However, there are some exceptions: • Variables that are named in a RETAIN statement will not be initialized to missing. • Variables that are created in a SUM statement will not be initialized to missing. • Data element in a _TEMPORTARY_ array will not be initialized to missing. • Any variables that are created with options in the FILE or INFILE statements will not be initialized. • Any automatic variables defined to retain its values will not be initialized.

  37. Debugging a Data Step Common Errors during the Compilation Phase: • Misspelled keywords and names. • Missing semicolons. • Unbalanced quotation marks. • Invalid options. SAS Actions for Compilation Error: • Print the word ERROR followed by the error message in SAS Log. • Complied, but not executed the step when error occurred and print the warning: • NOTE: The SAS System stopped processing this step because of errors. • Some errors can be explained fully, but, many are not. You need to check the syntax carefully to identify the errors and take actions.

  38. Errors in the Execution Phase When syntax is correct, SAS begins to read data in the execution phase. If errors occur during this phase, Common actions by SAS are: • A note, warning or error message is displayed in SAS log. • The values in the Program Data Vector are displayed in the log. • The processing of the phase may continue or stop. NOTE: Misspelling the fileref in the INFILE statement is not a syntax error. Since SAS does not valid the user provided reference until the execution phase. Incorrectly identify a variable type is an execution error, not a syntax error. Since SAS does not verify the variable type until data values are read. Errors in reading data values often result in missing data in the newly created data set.

  39. Some tips for Testing SAS Programs • Create _NULL_ Data Set, instead of a new SAS data set: • By using the Data Step: DATA _NULL_; you do not actually create a new SAS data set. This will save time and confusion regarding to what version of the new data set is created. • Use OBS = option in the INFILE statement to limit # of observations in the program testing stage. When compiling and testing a program, it is not necessary to read all data cases for testing. This is especially important if the data set is large. By using OBS=10 in the INFILE statement will only read 10 cases for testing both compilation and executing phases. • Use PUT statement to write specific error messages to SAS log for identifying some specific errors users.

  40. PUT statement to generate diagnosis information for checking data errors General Syntax; PUT Specifications; Where specifications may include • A character string, • One or more data set variables, • Automatic variables _N_, _ERROR_, • Automatic variable _ALL_ Ex: PUT ‘Data Error Checking:’ _ALL_;

  41. Use IF-THEN_ELSE statement along with the PUT statement to conditionally check for Values and test the program Read the loans.dat data and practice using IF-THEN-ELSE with PUT statements to check the data values: Data loan; Infile ‘C:\Math660\RawData\loan.dat’; Input Code $ account $ amount rate months payment; If code = ‘1’ then type = ‘Variable Interest Loan’ ; Else if code = ‘2’ then type = ‘Fixed Interest Loan’ ; Else PUT ‘This is an Invalid Code’ code = _N_ = _ERROR_ = ; Run;

  42. Use some summary PROC procedures for Validating and Cleaning Data Various PROC procedures are handy for helping to identify potential data errors, especially when there is NO error found in the Execution Phase. This is an important step in data analysis: Validating and Cleaning data The commonly used procedures are (More will be given later): PROC FREQ DATA = sas-data-set <options> ; TABLES variables; PROC MEANS DATA = sas-data-set <options> ; VAR variables; PROC UNIVARIATE DATA=sas-data-set <options>; VAR variables;

  43. Some Examples PROC FREQ data = loan; Tables code ; RUN; PROC MEANS data = loan; VAR amount months payment;

  44. Exercise 1 • Open dfwlax and dfwlax_n data sets, and check the difference. • You should see that • the values for Date in dfwlax.dat is Character type; • while, Date in dfwlax_n.dat is a Numeric type. • Write a SAS program to • read the external data dfwlax_n.dat data set, save it in your Work library • The () is the startCol.-endcol for each variable is: • Flight $(1-3), Date (4-11), Dest $(12-14), FirstClass(15-17), Economy (18-20) • Add PUT statement to check the possible typing error for Dest by listing the Case ID and the incorrect Dest values in the SAS log. • NOTE: Dest can be either LAX or DFW. • Print the newly created SAS dada set and use the format DATE9. to display the Date variable. • Try a wrong fileref in the INFILE statement to observe that this is not a syntax error, but an execution error

  45. Solution use FILENAME, INFILE • filename dfwfile 'c:\Math660\rawdata\dfwlax_n.dat'; • data dfwlax; • infile dfwfile; • input flight $ 1-3 date 4-11 dest $ 12-14 firstclass 15-17 economy 18-20; • if dest not in ('LAX' 'DFW') then • put 'Case ID =' _N_ 'Destination = ' dest; • run; • procprint; run;

  46. Solution use only INFILE data dfwlax; infile 'c:\Math660\rawdata\dfwlax_n.dat'; input flight $ 1-3 date 4-11 dest $ 12-14 firstclass 15-17 economy 18-20; if dest not in ('LAX' 'DFW') then put 'Case ID =' _N_ 'Destination = ' dest; run; procprint; run;

More Related