1 / 29

Subsetting and Merging SAS Data Sets

Learn how to read, subset, and merge SAS data sets to create new, permanent data sets. Includes examples and best practices.

rsharp
Download Presentation

Subsetting and Merging SAS Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lesson 7 - Topics • Reading SAS data sets • Sub-setting and merging SAS data sets • Permanent SAS data sets • Programs 10-13 in course notes • LSB 2:19-21 3:7 6:1-2,4-5,9,11-13

  2. Working With SAS Data Sets • Reading SAS dataset • SET Statement • Merging SAS datasets • MERGE Statement Done within a DATA step

  3. SET STATEMENT • Reads SAS data set (one row at a time) • Replaces INFILE and INPUT statements used when reading in raw data • KEEP brings in selected variables (columns) • Where brings in selected observations (rows) DATA new; SET old (KEEP = varlist); WHERE = condition; RUN; This creates a new data set called new that has the variables in varlist and selected observations from old.

  4. PROGRAM 10 Making SAS Datasets from Other SAS Datasets; DATA tdata; INFILE‘C:\SAS_Files\tomhs.data' ; INPUT @ 1 ptid $10. @ 12 clinic $1. @ 25 group 1. @ 30 sex 1. @ 123 sbp12 3. @ 14 randdate $10. ; RUN; * Making a new dataset containing only men; DATA men; SET tdata; * reads the existing dataset; WHERE sex = 1; This does the selection; ifgroup in(1,2,3,4,5)thenactive = 1;else ifgroup in(6)thenactive = 2; KEEP ptid clinic group sbp12 randdate active; RUN;

  5. * Making a new dataset containing only women; DATA women; SET tdata; WHERE sex = 2; ifgroup in(1,2,3,4,5)thenactive = 1;else ifgroup in(6)thenactive = 2; KEEP ptid clinic group sbp12 randdate active; RUN; We now have 3 datasets “active” tdata men women

  6. * Making both datasets in one data step; DATA men women; SET tdata; ifgroup in(1,2,3,4,5)thenactive = 1;else ifgroup in(6)thenactive = 2; ifsex = 1 then OUTPUTmen;else if sex= 2 then OUTPUT women; KEEP ptid clinic randdate group sbp12 active; RUN; Partial Log: NOTE: There were 100 obs read from WORK.TDATA NOTE: The data wet WORK.MEN has 73 obs and 7 variables NOTE: The data set WORK.WOMEN has 27 obs and 7 variables If SAS sees an OUTPUT statement then SAS will output only then; if there is no OUTPUT statement SAS outputs at the end of the data step.

  7. KEEP OPTION vs KEEP STATEMENT Purpose: Restricts variables read-in or written out DATA highbp; * Brings in only these variables; SET tdata (KEEP = ptid sex sbp12); RUN; DATA highbp; SET tdata ; * Reads in all variables; * Writes out only these variables; KEEP ptid sex sbp12; RUN; * There is also a DROP option/statement;

  8. WHERE Versus Logical IF DATA highbp; * Brings in only certain rows; SET tdata; WHERE sbp12 > 140; RUN; DATA highbp; SET tdata ; * Brings in all rows, outputs only certain rows (can be a new variable); IF sbp12 > 140;* If true then continue; RUN;

  9. PROGRAM 11 - Merging SAS Datasets DATA clinic; INFILE DATALINES; INPUT id $ sbp ; DATALINES; C03615 115 B00979 107 B00644 138 D01348 142 A01088 117 B01408 121 B00025 130 B00714 144 A01166 113 … more data ; DATA lab; INFILE DATALINES; INPUT id $ glucose; DATALINES; C03615 102 B00644 089 D01348 111 A01088 093 B01408 094 B00025 077 B00714 100 A01166 113 D00942 103 … more data ;

  10. * Creating merged dataset; PROC SORT DATA= clinic; BY id; PROC SORT DATA= lab; BY id; DATA study; MERGE clinic lab; BY id ; RUN; Note: The BY statement is very important!

  11. Merged Dataset Obs id sbp glucose 1 A00869 110 99 2 A01088 117 93 3 A01166 113 113 4 B00025 130 77 5 B00644 138 89 6 B00714 144 100 7 B00867 114 98 8 B00979 107 . 18 D00942 . 103 20 D01809 129 .

  12. What if you want only observations that are in both datasets? DATA study; MERGE clinic (IN=in1) lab (IN=in2); BY id; if in1 and in2; RUN; PROCPRINTDATA=study; TITLE‘Patients with Clinic and Lab'; RUN;

  13. Logical Statements * Must be in 1st dataset; if in1; * Same as: if in1 = 1; * Must be in 2nd dataset; if in2; * Must be in both datasets; if in1 and in2;

  14. Things to Remember When Merging Datasets • Need to have common variable name in each dataset to use as linking variable • Variables in dataset with no match will be set to missing • Rows matched that have same variable names will be assigned right-most dataset value • Always remember the BY statement in the merge

  15. Temporary vs Permanent SAS Datasets Temporary (or working) SAS dataset - After SAS session is over the dataset is deleted. DATA bp; * bp is deleted after SAS session; (rest of program) Permanent SAS dataset - After program is run the dataset is saved and is available for use in future programs. You need to tell SAS where to store/retrieve the dataset. Note: For PC SAS the working dataset is available until you end the SAS session.

  16. Reasons to Create Permanent SAS Datasets • Read raw data and compute calculated variables only once • All variables have assigned names and labels. • Data is ready to be analyzed. • Dataset can be sent to other computers or users.

  17. Creating a Permanent Dataset LIBNAMEmylib‘C:\My SAS Datasets’; LIBNAME – assigns a directory (folder) reference name. In this example the directory ‘C:\My SAS Datasets’ is assigned a reference name of mylib. DATAmylib.sescore; Tells SAS to create a dataset called sescore in the directory referenced by mylib, which is ‘C:\My SAS Datasets’.

  18. Examples of LIBNAME Statements LIBNAME mylib ‘C:\My SAS Files'; LIBNAME class ‘C:\My SAS Files' ; LIBNAME ph6420 'C:\My SAS Files\SASClass\' ; LIBNAME points to a directory (folder) DATA mylib.datasetname; DATA class.datasetname; DATA ph6420.datasetname; On UNIX and PC the file will be called datasetname.sas7bdat

  19. PROGRAM 12 LIBNAMEmylib‘C:\SAS_Files'; DATAmylib.sescore; INFILE‘C:\SAS_Files\tomhs.data'LRECL =400; INPUT @ 1 ptid $10. @ 12 clinic $1. @ 14 randdate mmddyy10. @ 25 group 1. @ 49 educ 1. @ 85 wtbl 5.1 @ 97 wt12 5.1 @115 sbpbl 3.0 @123 sbp12 3.0 @236 (sebl_1-sebl_20) (1. +1) @276 (se12_1-se12_20) (1. +1) ;

  20. wtd12 = wt12 - wtbl; sbpd12 = sbp12 - sbpbl; sescrbl = MEAN (OF sebl_1 - sebl_20) ; sescr12 = MEAN (OF se12_1 - se12_20) ; sescrd12 = sescr12 - sescrbl ; LABEL educ = 'Highest Education Level'; LABEL wt12 = 'Weight (lbs) at 12 Months'; LABEL wtbl = 'Weight (lbs) at Baseline'; LABEL wtd12 = 'Weight Change at Baseline'; LABEL sbpbl = 'Systolic BP (mmHg) at Baseline'; LABEL sbp12 = 'Systolic BP (mmHg) at 12 Months'; LABEL sbpd12 = 'Systolic BP Change at 12 Months'; LABEL group = 'Treatment Group (1-6)'; LABEL sescrbl = 'Side Effect at Baseline'; LABEL sescr12 = 'Side Effect at 12 Months'; LABEL sescrd12 = 'Side Effect Change Score'; FORMAT randdate mmddyy10. ; DROP sebl_1-sebl_20 se12_1-se12_20 ;

  21. 60 LIBNAME mylib 'C:\SAS_Files'; NOTE: Libref MYLIB was successfully assigned as follows: Engine: V9 Physical Name: C:\SAS_Files DATA mylib.sescore; NOTE: The infile 'C:\SAS_Files\tomhs.data' is: File Name=C:\SAS_Files\tomhs.data, RECFM=V,LRECL=400 NOTE: 100 records were read from the infile 'C:\SAS_Files\tomhs.data'. NOTE: The data set MYLIB.SESCORE has 100 observations and 14 variables.

  22. PROCCONTENTSDATA=mylib.sescore VARNUM ; TITLE'Description of Variables in Dataset SESCORE' ; RUN; What is inside a SAS dataset? Data Names, labels, and formats of all variables PROC CONTENTS reads the descriptor portion of the dataset

  23. Description of Variables in Dataset SESCORE The CONTENTS Procedure Data Set Name: MYLIB.SESCORE Observations: 100 Member Type: DATA Variables: 14 Engine: V9 Indexes: 0 Created: 10:59 Wednesday, August 11,2004 Observation Length: 112 Last Modified: 10:59 Wednesday, August 11,2004 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label: -----Engine/Host Dependent Information----- File Name: C:\SAS_Files\sescore.sas7bdat Release Created: 9.1.3 Host Created: XP_PRO File Size (bytes): 24576 Note: mylib is not a part of the dataset name

  24. Variables listed in creation order # Variable Type Len Pos Format Label ----------------------------------------------------------------------------- 1 ptid Char 10 96 Patient ID 2 clinic Char 1 106 Clinical Center 3 randdate Num 8 0 MMDDYY10. Randomization Date 4 group Num 8 8 Treatment Group (1-6) 5 educ Num 8 16 Highest Education Level 6 wtbl Num 8 24 Weight (lbs) at Baseine 7 wt12 Num 8 32 Weight (lbs) at 12 Months 8 sbpbl Num 8 40 Systolic BP (mmHg) at Baseline 9 sbp12 Num 8 48 Systolic BP (mmHg) at 12 Months 10 wtd12 Num 8 56 Weight Change at Baseline 11 sbpd12 Num 8 64 Systolic BP Change at 12 Months 12 sescrbl Num 8 72 Side Efect at Baseline 13 sescr12 Num 8 80 Side Efect at 12 Months 14 sescrd12 Num 8 88 Side Efect Change Score This becomes the documentation of the dataset

  25. Using PROC COPY to copy work dataset to permanent dataset Make a work dataset first – then when you know that is working correctly copy the work dataset to a permanent dataset. LIBNAMEmylib‘C:\SAS_Files'; DATA sescore; …. RUN; PROC COPYIN=work OUT=mylib; SELECT sescore; RUN;

  26. PROGRAM 13 LIBNAMEclass‘C:\SAS_Files' ; * Tells SAS where to find the SAS dataset; PROCMEANSDATA=class.sescore ; TITLE'Means of All Numeric Variables on SAS Permanent Dataset'; RUN; PROCCORRDATA=class.sescore; VAR wtd12 sbpd12 sescrd12; TITLE'Correlation Matrix of 3 Change Variables'; RUN; What if dataset was moved to a different folder? Just need to change LIBNAME

  27. Means of All Numeric Variables on SAS Permanent Dataset The MEANS Procedure Variable Label N Mean ------------------------------------------------------------------ randdate Randomization Date 100 10101.29 group Treatment Group (1-6) 100 3.62 educ Highest Education Level 99 6.00 wtbl Weight (lbs) at Baseline 100 191.76 wt12 Weight (lbs) at 12 Months 92 180.33 sbpbl Systolic BP (mmHg) at Baseline 100 139.92 sbp12 Systolic BP (mmHg) at 12 Months 92 124.04 wtd12 Weight Change at Baseline 92 -11.53 sbpd12 Systolic BP Change at 12 Months 92 -15.64 sescrbl Side Effect at Baseline 100 1.19 sescr12 Side Effect at 12 Months 95 1.16 sescrd12 Side Effect Change Score 95 -0.03 ------------------------------------------------------------------

  28. Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations wtd12 sbpd12 sescrd12 wtd12 1.00000 0.23986 0.15341 Weight Change at Baseline 0.0213 0.1443 92 92 92 sbpd12 0.23986 1.00000 0.05679 Systolic BP Change at 12 Months 0.0213 0.5908 92 92 92 sescrd12 0.15341 0.05679 1.00000 Side Efect Change Score 0.1443 0.5908 92 92 95

  29. *---------------------------------------------------------------**---------------------------------------------------------------* Often you will read the permanent SAS dataset in a DATA step to modify or add variables. Usually these will be put on a new temporary SAS dataset. The SET statement reads a SAS dataset *---------------------------------------------------------------*; LIBNAME class 'C:\SAS_Files' DATA rxdata; SET class.sescore; if group in(1,2,3,4,5) then rx = 1; else rx = 2; RUN; PROCMEANSDATA=rxdata NMEANMAXDEC=2FW=7; CLASS group; VAR sbpd12 wtd12 sescrd12; TITLE'Change in SBP, Weight, and Side Effect Score by Treatment'; RUN;

More Related