Ensuring Data Integrity with PROC COMPARE and Beyond - PowerPoint PPT Presentation

miyo
ensuring data integrity with proc compare and beyond n.
Skip this Video
Loading SlideShow in 5 Seconds..
Ensuring Data Integrity with PROC COMPARE and Beyond PowerPoint Presentation
Download Presentation
Ensuring Data Integrity with PROC COMPARE and Beyond

play fullscreen
1 / 67
Download Presentation
Ensuring Data Integrity with PROC COMPARE and Beyond
199 Views
Download Presentation

Ensuring Data Integrity with PROC COMPARE and Beyond

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Ensuring Data Integrity with PROC COMPARE and Beyond Scott A. Miller May 21, 2018 (Iowa SAS Users Group)May 22, 2018 (Nebraska SAS Users Group)

  2. Why Compare Data Sets? • Fixing a Bug • Did we actually fix the bug? • Did we create new problems in the process? • Reconciliation / Validation • Do the values in the new system match the old system? • Do the values in our report match the original source?

  3. Benefits of Comparison • Catch problems before rollout • Documentation for Change Management • Facilitate thorough testing

  4. PROC COMPARE: Start with the Basics

  5. Basic Syntax • This will compare all “matching” variables in both data sets. • The Base data set will be listed first in reports. proccomparebase=DSNAME1 compare=DSNAME2; run;

  6. What is a matching variable? • To be a “match” for PROC COMPARE… • Must have the same name • Must have the same data type • NUMERIC or CHARACTER • Not required: • LENGTH • LABEL • FORMAT

  7. Example #1 • Three changes: • Dropped color (red) • Added miles (green) • Reduced prices (blue)

  8. Example #1: Report Part 1

  9. Example #1: Report Part 2

  10. Example #1: Report Part 3

  11. Example #1: Report Part 4

  12. Variables not in both data sets • By default, • By default, all you get is a count • Would you like to see the variable names? • Use the LISTVAR option

  13. Variables not in both data sets proccomparebase=cars1 compare=cars2 listvar; run;

  14. Variable names don’t match? • Just rename them with a data set option. proccomparebase=cars2 compare=cars2b (rename=(description=desc lstprice=price)); run;

  15. Example #2 • Two changes: • Dropped 2010 BMW M3 (red) • Added 2002 Porsche 911 (green)

  16. Example #2: Report

  17. What happened? • By default, rows are compared side-by-side • What do we want?

  18. What do we want • By default, rows are compared side-by-side • What do we want? • Compare by key value

  19. ID Statement • With the ID statement, records with matching ID values will be compared. • Ideally, use a key that is unique and immutable proccomparebase=cars3 compare=cars4; id VIN; run;

  20. ID Statement WARNING: The data set WORK.CARS3 is not sorted by the ID variables. Observations will be matched using NOTSORTED logic with the assumption that the observations correspond one-to-one. NOTE: At observation 2 the current and previous ID values are: VIN=1111. VIN=5555. ERROR: The ID variable values do not match at observation 2 in the base data set WORK.CARS3 and observation 2 in the comparison data set WORK.CARS4. (When one or both data sets are not sorted by the ID variables, or when NOTSORTED is specified, the observations must match one-to-one.) • The simple solution is to sort the data sets.

  21. ID Statement procsortdata=cars3; by VIN; run; procsortdata=cars4; by VIN; run; proccomparebase=cars3 compare=cars4; id VIN; run;

  22. Example #2 (reminder) • Two changes: • Dropped 2010 BMW M3 (red) • Added 2002 Porsche 911 (green)

  23. Example #2: Report (using ID)

  24. Example #2: Report (using ID) • This part tells us 1 observation added, 1 observation deleted • This part tells us all observations with matching IDs are identical

  25. PROC COMPARE: Beyond the Basics

  26. Going Beyond the Basics • What we’ve covered up to this point will provide most of what you need to get started with PROC COMPARE. • Now, let’s cover some issues you may run into, and their solutions.

  27. Truncated Numbers • Let’s compare these two data sets: • Identical, except ending in “.11” vs “.22”

  28. Truncated Numbers • We are missing the last two digits

  29. FORMAT Statement • We can specify a format value comparison report. • Caveats: • Limited to space available -- 10 characters max * • This is not big enough for DATETIME • Display the time only portion with either timeampm10. or tod10. • * using SAS 9.4 M4 proccomparebase=m1 compare=m2; format num1 best10. num2 10.8; run;

  30. Truncated Numbers • Report now shows last two digits

  31. MAXPRINT option • By default, the variable comparison report shows… • Max of 50 differences per variable. • Max of 500 total differences total. • You can override the defaults with MAXPRINT. proccomparebase=ds1 compare=ds2 maxprint= (# per variable, # total); run;

  32. MAXPRINT option • If you have a lot variables: • If you want a lot of examples for one variable: proccomparebase=ds1 compare=ds2 maxprint=(10, 500); run; proccomparebase=ds1 (keep=varname1)compare=ds2 (keep=varname1) maxprint=(250, 250); run;

  33. Same but different • Sometimes PROC COMPARE will flag two numbers as different even though they look the same. • Consider these two equations: • X = A * B * C • X = A * (B * C) • We know that mathematically, these are identical

  34. Same but different • As expected, m1 and m2 are identical data m1; set numbers; x= a * b * c; run; data m2; set numbers; x= a *(b * c); run; numbers

  35. Same but different • What happens when we use PROC COMPARE?

  36. Same but different • What is going on here? • A difference of -1.46E-11 ? • -0.0000000000146 • Root cause: binary approximation of floating point • Beyond the scope of this presentation

  37. Same but different • Many things can trigger this small difference • Rounding won’t always fix this • Real problems get buried in the noise • We need to be able to ignore small differences!

  38. METHOD parameter • We can ignore small differences with the METHOD parameter • EXACT (default) • ABSOLUTE – will ignore differences smaller than the value specified by CRITERION • PERCENT and RELATIVE – threshold is based on the magnitude of value that is changing

  39. Same but different • This code will ignore differences smaller than 0.01 proccomparebase=m1 compare=m2 method=absolutecriterion=0.01; run;

  40. Going Beyond PROC COMPARE

  41. Going Beyond PROC COMPARE • PROC COMPARE can do some amazing things, but has its limitations • We can also compare data sets using our own code • Standard DATA steps • PROC SQL

  42. Find Unmatched Observations • Finding Unmatched Observations is a frequently required task • Especially useful with interdependent variables • We can print samples from both data sets with all variables • Use this to determine root cause

  43. Find Unmatched Observations dataunmatched_aunmatched_b; merge data1 (in=in_a) data2 (in=in_b); byunq_id _ALL_; if(notin_b)thenoutputunmatched_a; elseif(not in_a)thenoutputunmatched_b; run; • Replace unq_id with variable(s) that are unique ID. • Both data sets must be sorted by this unique ID

  44. How Does This Work? dataunmatched_aunmatched_b; merge data1 (in=in_a) data2 (in=in_b); byunq_id _ALL_; if(notin_b)thenoutputunmatched_a; elseif(not in_a)thenoutputunmatched_b; run; • Observations that exactly match on all vars in BY statement are merged • _ALL_ means all variables • Unmatched records are output

  45. Find Unmatched Observations • So which observations would be output?

  46. Find Unmatched Observations • So which observations would be output? • unmatched_a has 3 observations • unmatched_b has 2 observations

  47. Find Unmatched Observations • We’ve dropped var4 from second data set. • Now which observations would be output?

  48. Find Unmatched Observations • The variable lists must match • Use KEEP or DROP • Protip: can automate with SQL dictionary table and macro variable ERROR: BY variable var4 is not on input data set WORK.DATA2. NOTE: The SAS System stopped processing this step because of errors.

  49. Quantify the change • PROC COMPARE is good at telling you… • Is it different? • Is it the same? • What if you want to know… • Was the change significant? • Did things get better or worse?

  50. Quantify the change • We have two data sets with a thousand scores… • What can we say about these scores?