1 / 39

The Complexities and Options of Merging/Lookups with Base/SAS

The Complexities and Options of Merging/Lookups with Base/SAS . Jennifer Sinodis Metris Companies. Task. Need to Identify specific records in the Detail File based on criteria in the Lookup Information Table. Scenario. Have Two Files:

arawn
Download Presentation

The Complexities and Options of Merging/Lookups with Base/SAS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Complexities and Options of Merging/Lookups with Base/SAS Jennifer Sinodis Metris Companies

  2. Task • Need to Identify specific records in the Detail File based on criteria in • the Lookup Information Table Scenario • Have Two Files: • Detail Data in a Comma Delimited Text File (csv) with a Unique Key • Lookup Information in a SAS Data Set

  3. SAS Options • MERGE Statement • PROC SQL • PUT Statement • PROC FORMAT • MODIFY Statement

  4. Build Master_Data.csv /* Define Library for Permanent Data Sets */ LIBNAME VALSUG '/home/valsug'; /* Create Data Set to recreate Scenario */ DATA VALSUG.MASTER_DATA_CHK; INPUT ACCT_NO BRANCH_NO $ PRODUCT $; CARDS; 111222 1000A CC 111333 1000B CC 111555 1000C HE 111666 1000D CC 111777 1000B HE 111888 1000E CC 111999 1000A HE ; RUN; /* For Example: Detail Data File is unique by Account Number (Acct_No) with Branch Number (Branch_No), and Product Type (Product) */

  5. Build Master_Data.csv (continued) /* Export SAS Data Set to create Comma Delimited Text File (.csv) */ DATA _NULL_; SET VALSUG.MASTER_DATA_CHK; FILE '/home/25495/sample_master_data.csv' delimiter = ',' DSD dropover LRECL=32767; FORMAT ACCT_NO BEST12.; FORMAT BRANCH_NO $8.; FORMAT PRODUCT $8.; PUT ACCT_NO @; PUT BRANCH_NO $ @; PUT PRODUCT $; RUN;

  6. View Master_Data.csv

  7. Build Branch_Table /* Create Lookup Branch Table */ DATA VALSUG.BRANCH_TABLE; INPUT BRANCH_NO $ BRANCH_NAME $ REGION $; CARDS; 1000A GREEN MIDWEST 1000B EASTON WEST 1000C MIDVILLE WEST 1000D COLBY EAST 1000E SUDDER EAST 1000F ALBANY EAST ; RUN; /* For Example: Lookup Table is Branch Information unique by Branch Number (Branch_No with Branch Name (Branch_Name) and Region Information */

  8. View Branch_Table

  9. Using MERGE Statement /* In order to use MERGE Statement, must convert csv file to a SAS Data Set */ /*---------------------------------------------------------------------------------*/ DATA MASTER_DATA; INFILE '/home/sample_master_data.csv' delimiter = ','; INPUT @1 ACCT_NO 6. @8 BRANCH_NO $ @14 PRODUCT $2.; RUN;

  10. Using MERGE Statement - LOG 54 /*---------------------------------------------------------------------------------*/ 55 DATA MASTER_DATA; 56 INFILE '/home/sample_master_data.csv' delimiter = ','; 57 INPUT @1 ACCT_NO 6. @8 BRANCH_NO $ @14 PRODUCT $2.; 58 RUN; NOTE: The infile '/home/sample_master_data.csv' is: File Name=/home/sample_master_data.csv, Owner Name=*****,Group Name=staff, Access Permission=rw-r--r--, File Size (bytes)=112 NOTE: 7 records were read from the infile '/home/sample_master_data.csv'. The minimum record length was 15. The maximum record length was 15. NOTE: The data set WORK.MASTER_DATA has 7 observations and 3 variables. NOTE: DATA statement used: real time 0.12 seconds cpu time 0.03 seconds

  11. View Master_Data

  12. Using MERGE Statement (continued) /* Both Data Sets must be Sorted to Use MERGE Statement */ PROC SORT DATA=MASTER_DATA; BY BRANCH_NO; RUN; PROC SORT DATA=VALSUG.BRANCH_TABLE; BY BRANCH_NO; RUN; DATA MASTER_DATA_EAST; MERGE MASTER_DATA (IN=INA) VALSUG.BRANCH_TABLE (IN=INB); BY BRANCH_NO; IF INA; IF REGION IN ('EAST') THEN OUTPUT; RUN; TITLE 'Using MERGE Statement'; PROC PRINT; RUN; /*---------------------------------------------------------------------------------*/

  13. Using MERGE Statement (continued) - LOG 60 /* USING MERGE STATEMENT */ 61 PROC SORT DATA=MASTER_DATA; 62 BY BRANCH_NO; 63 RUN; NOTE: There were 7 observations read from the data set WORK.MASTER_DATA. NOTE: The data set WORK.MASTER_DATA has 7 observations and 3 variables. NOTE: PROCEDURE SORT used: real time 0.06 seconds cpu time 0.01 seconds 64 65 PROC SORT DATA=VALSUG.BRANCH_TABLE; 66 BY BRANCH_NO; 67 RUN; NOTE: There were 6 observations read from the data set VALSUG.BRANCH_TABLE. NOTE: The data set VALSUG.BRANCH_TABLE has 6 observations and 3 variables. NOTE: PROCEDURE SORT used: real time 0.04 seconds cpu time 0.01 seconds

  14. Using MERGE Statement (continued) - LOG 69 DATA MASTER_DATA_EAST; 70 MERGE MASTER_DATA (IN=INA) VALSUG.BRANCH_TABLE (IN=INB); 71 BY BRANCH_NO; 72 IF INA; 73 IF REGION IN ('EAST') THEN OUTPUT; 74 RUN; NOTE: There were 7 observations read from the data set WORK.MASTER_DATA. NOTE: There were 6 observations read from the data set VALSUG.BRANCH_TABLE. NOTE: The data set WORK.MASTER_DATA_EAST has 2 observations and 5 variables. NOTE: DATA statement used: real time 0.08 seconds cpu time 0.02 seconds 75 76 TITLE 'Using MERGE Statement'; 77 PROC PRINT; 78 RUN; NOTE: There were 2 observations read from the data set WORK.MASTER_DATA_EAST. NOTE: PROCEDURE PRINT used: real time 0.02 seconds cpu time 0.00 seconds

  15. Using MERGE Statement - OUTPUT Using MERGE Statement BRANCH_ BRANCH_ Obs ACCT_NO NO PRODUCT NAME REGION 1 111666 1000D CC COLBY EAST 2 111888 1000E CC SUDDER EAST

  16. Using PROC SQL /* In order to use PROC SQL, must convert csv file to a SAS Data Set */ /*---------------------------------------------------------------------------------*/ DATA MASTER_DATA; INFILE '/home/sample_master_data.csv' delimiter = ','; INPUT @1 ACCT_NO 6. @8 BRANCH_NO $ @14 PRODUCT $2.; RUN; PROC SQL; CREATE TABLE MASTER_DATA_EAST2 AS SELECT A.*, B.BRANCH_NAME FROM MASTER_DATA A LEFT JOIN VALSUG.BRANCH_TABLE B ON A.BRANCH_NO = B.BRANCH_NO WHERE REGION EQ 'EAST'; QUIT; TITLE 'Using PROC SQL'; PROC PRINT; RUN; /*---------------------------------------------------------------------------------*/

  17. Using PROC SQL - LOG 91 PROC SQL; 92 CREATE TABLE MASTER_DATA_EAST2 AS 93 SELECT A.*, B.BRANCH_NAME 94 FROM MASTER_DATA A LEFT JOIN VALSUG.BRANCH_TABLE B 95 ON A.BRANCH_NO = B.BRANCH_NO 96 WHERE REGION EQ 'EAST'; NOTE: Table WORK.MASTER_DATA_EAST2 created, with 2 rows and 4 columns. 97 QUIT; NOTE: PROCEDURE SQL used: real time 0.09 seconds cpu time 0.02 seconds 98 99 TITLE 'Using PROC SQL '; 100 PROC PRINT; 101 RUN; NOTE: There were 2 observations read from the data set WORK.MASTER_DATA_EAST2. NOTE: PROCEDURE PRINT used: real time 0.02 seconds cpu time 0.02 seconds

  18. Using PROC SQL - OUTPUT Using PROC SQL BRANCH_ BRANCH_ Obs ACCT_NO NO PRODUCT NAME 1 111666 1000D CC COLBY 2 111888 1000E CC SUDDER

  19. Using PUT Statement /* In order to use PUT Statement, must convert csv file to a SAS Data Set */ /*---------------------------------------------------------------------------------*/ DATA MASTER_DATA; INFILE '/home/sample_master_data.csv' delimiter = ','; INPUT @1 ACCT_NO 6. @8 BRANCH_NO $ @14 PRODUCT $2.; RUN; DATA _NULL_; FILE '/home/Build_Branch_Info.sas'; SET VALSUG.BRANCH_TABLE; PUT "IF BRANCH_NO EQ '" BRANCH_NO + (-1) "' THEN DO;"; PUT "BRANCH_NAME = '" BRANCH_NAME + (-1) "'; "; PUT "REGION = '" REGION + (-1) "'; END; "; RUN;

  20. Using PUT Statement - LOG 113 DATA _NULL_; 114 FILE '/home/Build_Branch_Info.sas'; 115 SET VALSUG.BRANCH_TABLE; 116 PUT "IF BRANCH_NO EQ '" BRANCH_NO + (-1) "' THEN DO;"; 117 PUT "BRANCH_NAME = '" BRANCH_NAME + (-1) "'; "; 118 PUT "REGION = '" REGION + (-1) "'; END; "; 119 RUN; NOTE: The file '/home/Build_Branch_Info.sas' is: File Name=/home/Build_Branch_Info.sas, Owner Name=*****,Group Name=staff, Access Permission=rw-r--r-- NOTE: 18 records were written to the file '/home/Build_Branch_Info.sas'. The minimum record length was 22. The maximum record length was 32. NOTE: There were 6 observations read from the data set VALSUG.BRANCH_TABLE. NOTE: DATA statement used: real time 0.13 seconds cpu time 0.00 seconds

  21. View Build_Branch_Info.sas IF BRANCH_NO EQ '1000A' THEN DO; BRANCH_NAME = 'GREEN'; REGION = 'MIDWEST'; END; IF BRANCH_NO EQ '1000B' THEN DO; BRANCH_NAME = 'EASTON'; REGION = 'WEST'; END; IF BRANCH_NO EQ '1000C' THEN DO; BRANCH_NAME = 'MIDVILLE'; REGION = 'WEST'; END; IF BRANCH_NO EQ '1000D' THEN DO; BRANCH_NAME = 'COLBY'; REGION = 'EAST'; END; IF BRANCH_NO EQ '1000E' THEN DO; BRANCH_NAME = 'SUDDER'; REGION = 'EAST'; END; IF BRANCH_NO EQ '1000F' THEN DO; BRANCH_NAME = 'ALBANY'; REGION = 'EAST'; END;

  22. Using PUT Statement (continued) DATA MASTER_DATA_EAST3; SET MASTER_DATA; LENGTH REGION $8. BRANCH_NAME $8.; %inc '/home/Build_Branch_Info.sas'; IF REGION EQ 'EAST' THEN OUTPUT; RUN; TITLE 'Using PUT Statement '; PROC PRINT; RUN; /*---------------------------------------------------------------------------------*/

  23. Using PUT Statement (continued) - LOG 121 DATA MASTER_DATA_EAST3; 122 SET MASTER_DATA; 123 124 LENGTH REGION $8. BRANCH_NAME $8.; 125 %inc '/home/Build_Branch_Info.sas'; 144 145 IF REGION EQ 'EAST' THEN OUTPUT; 146 RUN; NOTE: There were 7 observations read from the data set WORK.MASTER_DATA. NOTE: The data set WORK.MASTER_DATA_EAST3 has 2 observations and 5 variables. NOTE: DATA statement used: real time 0.10 seconds cpu time 0.01 seconds 148 TITLE 'Using PUT Statement '; 149 PROC PRINT; 150 RUN; NOTE: There were 2 observations read from the data set WORK.MASTER_DATA_EAST3. NOTE: PROCEDURE PRINT used: real time 0.03 seconds cpu time 0.00 seconds

  24. Using PUT Statement - OUTPUT Using PUT Statement BRANCH_ BRANCH_ Obs ACCT_NO NO PRODUCT NAME 1 111666 1000D CC COLBY 2 111888 1000E CC SUDDER

  25. Using PROC FORMAT • /* Create a format based on Branch Table */ • /*---------------------------------------------------------------------------------*/ • DATA FORMAT(KEEP = START LABEL FMTNAME); • SET VALSUG.BRANCH_TABLE; • WHERE REGION EQ 'EAST'; • START = BRANCH_NO; • LABEL = 'EAST'; • FMTNAME = '$EBRANCH'; • RUN; • PROC FORMAT CNTLIN=FORMAT; • RUN; • *CNTLIN=input-control-SAS-data-set • specifies a SAS data set from which PROC FORMAT builds informats and formats. • CNTLIN= builds formats and informats without using a VALUE, PICTURE, or INVALUE statement.

  26. View Format Data Set

  27. Using PROC FORMAT - LOG 158 DATA FORMAT(KEEP = START LABEL FMTNAME); 159 SET VALSUG.OLD_BRANCH_TABLE; 160 WHERE REGION EQ 'EAST'; 161 START = BRANCH_NO; 162 LABEL = 'EAST'; 163 FMTNAME = '$EBRANCH'; 164 RUN; NOTE: There were 3 observations read from the data set VALSUG.OLD_BRANCH_TABLE. WHERE REGION='EAST'; NOTE: The data set WORK.FORMAT has 3 observations and 3 variables. NOTE: DATA statement used: real time 0.08 seconds cpu time 0.01 seconds 172 PROC FORMAT CNTLIN=FORMAT; NOTE: Format $EBRANCH has been output. 173 RUN; NOTE: PROCEDURE FORMAT used: real time 0.13 seconds cpu time 0.02 seconds NOTE: There were 3 observations read from the data set WORK.FORMAT.

  28. Using PROC FORMAT (continued) DATA _NULL_; FILE '/home/Build_Branch_Name.sas'; SET VALSUG. BRANCH_TABLE; PUT "IF BRANCH_NO EQ '" BRANCH_NO + (-1) "' THEN BRANCH_NAME = '" BRANCH_NAME + (-1) "';"; RUN; DATA MASTER_DATA_EAST4; INFILE '/home/sample_master_data.csv' delimiter = ','; INPUT @1 ACCT_NO @8 BRANCH_NO $ @; IF PUT(BRANCH_NO, $EBRANCH.) = 'EAST’ ; INPUT @14 PRODUCT $2.; LENGTH BRANCH_NAME $8.; %inc '/home/Build_Branch_Name.sas'; RUN; TITLE 'Using PROC FORMAT '; PROC PRINT; RUN; /*---------------------------------------------------------------------------------*/

  29. Using PROC FORMAT (continued) - LOG 175 DATA MASTER_DATA_EAST4; 176 INFILE '/home/sample_master_data.csv' delimiter = ','; 177 INPUT @1 ACCT_NO @8 BRANCH_NO $ @; /* Use the Trailing @ */ 178 IF PUT(BRANCH_NO, $EBRANCH.) = 'EAST’ ; 179 INPUT @14 PRODUCT $2.; 180 LENGTH BRANCH_NAME $8.; 182 %inc '/home/25495/Build_Branch_Name.sas'; 189 190 RUN; NOTE: The infile '/home/sample_master_data.csv' is: File Name=/home/sample_master_data.csv, Owner Name=*****,Group Name=staff, Access Permission=rw-r--r--, File Size (bytes)=112 NOTE: 7 records were read from the infile '/home/sample_master_data.csv'. The minimum record length was 15. The maximum record length was 15. NOTE: The data set WORK.MASTER_DATA_EAST4 has 2 observations and 4 variables. NOTE: DATA statement used: real time 0.17 seconds cpu time 0.04 seconds

  30. Using PROC FORMAT - OUTPUT Using PROC FORMAT BRANCH_ BRANCH_ Obs ACCT_NO NO PRODUCT NAME 1 111666 1000D CC COLBY 2 111888 1000E CC SUDDER

  31. Using MODIFY Statement /* In order to use MODIFY Statement, must convert csv file to a SAS Data Set */ /* Must also create variables need from lookup table */ /*---------------------------------------------------------------------------------*/ DATA MASTER_DATA; INFILE '/home/sample_master_data.csv' delimiter = ','; INPUT @1 ACCT_NO 6. @8 BRANCH_NO $ @14 PRODUCT $2.; LENGTH BRANCH_NAME $8. REGION $8.; BRANCH_NAME = ' '; REGION = ' '; RUN; /* Create an Index on the Master_Data */ PROC DATASETS LIBRARY=WORK NOLIST; MODIFY MASTER_DATA5; INDEX CREATE BRANCH_NO; RUN; /* Sort Lookup Table by same Indexed Variable on Master_Data */ PROC SORT DATA=VALSUG.BRANCH_TABLE OUT=VALSUG. BRANCH_TABLE (RENAME=(BRANCH_NAME=BRANCH_NAMEX REGION = REGIONX)); BY BRANCH_NO; RUN;

  32. Using MODIFY Statement - LOG 202 DATA MASTER_DATA5 ; 203 INFILE '/home/25495/sample_master_data.csv' delimiter = ','; 204 INPUT @1 ACCT_NO 6. @8 BRANCH_NO $ @14 PRODUCT $2.; 205 206 LENGTH BRANCH_NAME $8. REGION $8.; 207 BRANCH_NAME = ' '; 208 REGION = ' '; 209 RUN; NOTE: The infile '/home/sample_master_data.csv' is: File Name=/home/sample_master_data.csv, Owner Name=*****,Group Name=staff, Access Permission=rw-r--r--, File Size (bytes)=112 NOTE: 7 records were read from the infile '/home/sample_master_data.csv'. The minimum record length was 15. The maximum record length was 15. NOTE: The data set WORK.MASTER_DATA5 has 7 observations and 5 variables. NOTE: DATA statement used: real time 0.15 seconds cpu time 0.03 seconds

  33. Using MODIFY Statement - LOG 211 PROC DATASETS LIBRARY=WORK NOLIST; 212 MODIFY MASTER_DATA5; 213 INDEX CREATE BRANCH_NO; NOTE: Simple index BRANCH_NO has been defined. 214 RUN; NOTE: PROCEDURE DATASETS used: real time 0.18 seconds cpu time 0.00 seconds 216 PROC SORT DATA=VALSUG.BRANCH_TABLE OUT=VALSUG.BRANCH_TABLE 217 (RENAME=(BRANCH_NAME=BRANCH_NAMEX REGION = REGIONX)); 218 BY BRANCH_NO; 219 RUN; NOTE: Input data set is already sorted; it has been copied to the output data set. NOTE: There were 6 observations read from the data set VALSUG.BRANCH_TABLE. NOTE: The data set VALSUG.BRANCH_TABLE has 6 observations and 3 variables. NOTE: PROCEDURE SORT used: real time 0.06 seconds cpu time 0.01 seconds

  34. Using MODIFY Statement (continued) DATA MASTER_DATA5; SET VALSUG.BRANCH_TABLE; DO UNTIL (_IORC_ = %SYSRC(_dsenom)); MODIFY MASTER_DATA5 KEY= BRANCH_NO; IF _IORC_ = 0 THEN DO; BRANCH_NAME = BRANCH_NAMEX; REGION = REGIONX; IF REGION = 'EAST' THEN REPLACE; ELSE REMOVE; END; ELSE DO; _ERROR_=0; END; END; RUN; TITLE 'Using DATA MODIFY '; PROC PRINT DATA=MASTER_DATA5; RUN; /*-----------------------------------------------------------------*/ • _IORC_ = 0 indicates that the MODIFY statement was successful, and that a match was found in the transaction data set. • Override the _ERROR_ flag to prevent SAS from dumping out any records in the transaction data set that are not in the master data set. • Use the autocall macro %SYSRC with _IORC_ to monitor success of MODIFY statement. • _dsenom indicates no matching observation, used with the KEY= option.

  35. Using MODIFY Statement (continued) - LOG 228 DATA MASTER_DATA5; 229 SET VALSUG.OLD_BRANCH_TABLE; 230 DO UNTIL (_IORC_ = %SYSRC(_dsenom)); 231 MODIFY MASTER_DATA5 KEY=BRANCH_NO; 232 IF _IORC_ = 0 THEN DO; 233 BRANCH_NAME = BRANCH_NAMEX; 234 REGION = REGIONX; 235 IF REGION = 'EAST' THEN REPLACE; 236 ELSE REMOVE; 237 END; 238 ELSE DO; 239 _ERROR_=0; 240 END; 241 END; 242 RUN; NOTE: There were 6 observations read from the data set VALSUG.OLD_BRANCH_TABLE. NOTE: The data set WORK.MASTER_DATA5 has been updated. There were 2 observations rewritten, 0 observations added and 5 observations deleted. NOTE: DATA statement used: real time 0.21 seconds cpu time 0.03 seconds

  36. Using MODIFY Statement - OUTPUT Using MODIFY Statement BRANCH_ BRANCH_ Obs ACCT_NO NO PRODUCT NAME REGION 1 111666 1000D CC COLBY EAST 2 111888 1000E CC SUDDER EAST

  37. Conclusion • Resulting data sets were all very similar, and accomplished our task. • Exactly which scenario you should use really depends on the data you are working with and your available resources. • Ask yourself the following: • Are the data sets indexed? • Do I have a space or processing issue with rebuilding my entire master data set? • Do I need to beware of CPU issues?

More Related