Ensuring Data Integrity with PROC COMPARE and Beyond Scott A. Miller May 21, 2018 (Iowa SAS Users Group)May 22, 2018 (Nebraska SAS Users Group)
Why Compare Data Sets? • Fixing a Bug • Did we actually fix the bug? • Did we create new problems in the process? • Reconciliation / Validation • Do the values in the new system match the old system? • Do the values in our report match the original source?
Benefits of Comparison • Catch problems before rollout • Documentation for Change Management • Facilitate thorough testing
Basic Syntax • This will compare all “matching” variables in both data sets. • The Base data set will be listed first in reports. proccomparebase=DSNAME1 compare=DSNAME2; run;
What is a matching variable? • To be a “match” for PROC COMPARE… • Must have the same name • Must have the same data type • NUMERIC or CHARACTER • Not required: • LENGTH • LABEL • FORMAT
Example #1 • Three changes: • Dropped color (red) • Added miles (green) • Reduced prices (blue)
Variables not in both data sets • By default, • By default, all you get is a count • Would you like to see the variable names? • Use the LISTVAR option
Variables not in both data sets proccomparebase=cars1 compare=cars2 listvar; run;
Variable names don’t match? • Just rename them with a data set option. proccomparebase=cars2 compare=cars2b (rename=(description=desc lstprice=price)); run;
Example #2 • Two changes: • Dropped 2010 BMW M3 (red) • Added 2002 Porsche 911 (green)
What happened? • By default, rows are compared side-by-side • What do we want?
What do we want • By default, rows are compared side-by-side • What do we want? • Compare by key value
ID Statement • With the ID statement, records with matching ID values will be compared. • Ideally, use a key that is unique and immutable proccomparebase=cars3 compare=cars4; id VIN; run;
ID Statement WARNING: The data set WORK.CARS3 is not sorted by the ID variables. Observations will be matched using NOTSORTED logic with the assumption that the observations correspond one-to-one. NOTE: At observation 2 the current and previous ID values are: VIN=1111. VIN=5555. ERROR: The ID variable values do not match at observation 2 in the base data set WORK.CARS3 and observation 2 in the comparison data set WORK.CARS4. (When one or both data sets are not sorted by the ID variables, or when NOTSORTED is specified, the observations must match one-to-one.) • The simple solution is to sort the data sets.
ID Statement procsortdata=cars3; by VIN; run; procsortdata=cars4; by VIN; run; proccomparebase=cars3 compare=cars4; id VIN; run;
Example #2 (reminder) • Two changes: • Dropped 2010 BMW M3 (red) • Added 2002 Porsche 911 (green)
Example #2: Report (using ID) • This part tells us 1 observation added, 1 observation deleted • This part tells us all observations with matching IDs are identical
Going Beyond the Basics • What we’ve covered up to this point will provide most of what you need to get started with PROC COMPARE. • Now, let’s cover some issues you may run into, and their solutions.
Truncated Numbers • Let’s compare these two data sets: • Identical, except ending in “.11” vs “.22”
Truncated Numbers • We are missing the last two digits
FORMAT Statement • We can specify a format value comparison report. • Caveats: • Limited to space available -- 10 characters max * • This is not big enough for DATETIME • Display the time only portion with either timeampm10. or tod10. • * using SAS 9.4 M4 proccomparebase=m1 compare=m2; format num1 best10. num2 10.8; run;
Truncated Numbers • Report now shows last two digits
MAXPRINT option • By default, the variable comparison report shows… • Max of 50 differences per variable. • Max of 500 total differences total. • You can override the defaults with MAXPRINT. proccomparebase=ds1 compare=ds2 maxprint= (# per variable, # total); run;
MAXPRINT option • If you have a lot variables: • If you want a lot of examples for one variable: proccomparebase=ds1 compare=ds2 maxprint=(10, 500); run; proccomparebase=ds1 (keep=varname1)compare=ds2 (keep=varname1) maxprint=(250, 250); run;
Same but different • Sometimes PROC COMPARE will flag two numbers as different even though they look the same. • Consider these two equations: • X = A * B * C • X = A * (B * C) • We know that mathematically, these are identical
Same but different • As expected, m1 and m2 are identical data m1; set numbers; x= a * b * c; run; data m2; set numbers; x= a *(b * c); run; numbers
Same but different • What happens when we use PROC COMPARE?
Same but different • What is going on here? • A difference of -1.46E-11 ? • -0.0000000000146 • Root cause: binary approximation of floating point • Beyond the scope of this presentation
Same but different • Many things can trigger this small difference • Rounding won’t always fix this • Real problems get buried in the noise • We need to be able to ignore small differences!
METHOD parameter • We can ignore small differences with the METHOD parameter • EXACT (default) • ABSOLUTE – will ignore differences smaller than the value specified by CRITERION • PERCENT and RELATIVE – threshold is based on the magnitude of value that is changing
Same but different • This code will ignore differences smaller than 0.01 proccomparebase=m1 compare=m2 method=absolutecriterion=0.01; run;
Going Beyond PROC COMPARE • PROC COMPARE can do some amazing things, but has its limitations • We can also compare data sets using our own code • Standard DATA steps • PROC SQL
Find Unmatched Observations • Finding Unmatched Observations is a frequently required task • Especially useful with interdependent variables • We can print samples from both data sets with all variables • Use this to determine root cause
Find Unmatched Observations dataunmatched_aunmatched_b; merge data1 (in=in_a) data2 (in=in_b); byunq_id _ALL_; if(notin_b)thenoutputunmatched_a; elseif(not in_a)thenoutputunmatched_b; run; • Replace unq_id with variable(s) that are unique ID. • Both data sets must be sorted by this unique ID
How Does This Work? dataunmatched_aunmatched_b; merge data1 (in=in_a) data2 (in=in_b); byunq_id _ALL_; if(notin_b)thenoutputunmatched_a; elseif(not in_a)thenoutputunmatched_b; run; • Observations that exactly match on all vars in BY statement are merged • _ALL_ means all variables • Unmatched records are output
Find Unmatched Observations • So which observations would be output?
Find Unmatched Observations • So which observations would be output? • unmatched_a has 3 observations • unmatched_b has 2 observations
Find Unmatched Observations • We’ve dropped var4 from second data set. • Now which observations would be output?
Find Unmatched Observations • The variable lists must match • Use KEEP or DROP • Protip: can automate with SQL dictionary table and macro variable ERROR: BY variable var4 is not on input data set WORK.DATA2. NOTE: The SAS System stopped processing this step because of errors.
Quantify the change • PROC COMPARE is good at telling you… • Is it different? • Is it the same? • What if you want to know… • Was the change significant? • Did things get better or worse?
Quantify the change • We have two data sets with a thousand scores… • What can we say about these scores?