Ensuring Data Integrity with PROC COMPARE and Beyond

Ensuring Data Integrity with PROC COMPARE and Beyond Scott A. Miller May 21, 2018 (Iowa SAS Users Group)May 22, 2018 (Nebraska SAS Users Group)

Why Compare Data Sets? • Fixing a Bug • Did we actually fix the bug? • Did we create new problems in the process? • Reconciliation / Validation • Do the values in the new system match the old system? • Do the values in our report match the original source?

Benefits of Comparison • Catch problems before rollout • Documentation for Change Management • Facilitate thorough testing

PROC COMPARE: Start with the Basics

Basic Syntax • This will compare all “matching” variables in both data sets. • The Base data set will be listed first in reports. proccomparebase=DSNAME1 compare=DSNAME2; run;

What is a matching variable? • To be a “match” for PROC COMPARE… • Must have the same name • Must have the same data type • NUMERIC or CHARACTER • Not required: • LENGTH • LABEL • FORMAT

Example #1 • Three changes: • Dropped color (red) • Added miles (green) • Reduced prices (blue)

Example #1: Report Part 1

Variables not in both data sets • By default, • By default, all you get is a count • Would you like to see the variable names? • Use the LISTVAR option

Variables not in both data sets proccomparebase=cars1 compare=cars2 listvar; run;

Variable names don’t match? • Just rename them with a data set option. proccomparebase=cars2 compare=cars2b (rename=(description=desc lstprice=price)); run;

Example #2 • Two changes: • Dropped 2010 BMW M3 (red) • Added 2002 Porsche 911 (green)

Example #2: Report

What happened? • By default, rows are compared side-by-side • What do we want?

What do we want • By default, rows are compared side-by-side • What do we want? • Compare by key value

ID Statement • With the ID statement, records with matching ID values will be compared. • Ideally, use a key that is unique and immutable proccomparebase=cars3 compare=cars4; id VIN; run;

ID Statement WARNING: The data set WORK.CARS3 is not sorted by the ID variables. Observations will be matched using NOTSORTED logic with the assumption that the observations correspond one-to-one. NOTE: At observation 2 the current and previous ID values are: VIN=1111. VIN=5555. ERROR: The ID variable values do not match at observation 2 in the base data set WORK.CARS3 and observation 2 in the comparison data set WORK.CARS4. (When one or both data sets are not sorted by the ID variables, or when NOTSORTED is specified, the observations must match one-to-one.) • The simple solution is to sort the data sets.

ID Statement procsortdata=cars3; by VIN; run; procsortdata=cars4; by VIN; run; proccomparebase=cars3 compare=cars4; id VIN; run;

Example #2 (reminder) • Two changes: • Dropped 2010 BMW M3 (red) • Added 2002 Porsche 911 (green)

Example #2: Report (using ID)

Example #2: Report (using ID) • This part tells us 1 observation added, 1 observation deleted • This part tells us all observations with matching IDs are identical

PROC COMPARE: Beyond the Basics

Going Beyond the Basics • What we’ve covered up to this point will provide most of what you need to get started with PROC COMPARE. • Now, let’s cover some issues you may run into, and their solutions.

Truncated Numbers • Let’s compare these two data sets: • Identical, except ending in “.11” vs “.22”

Truncated Numbers • We are missing the last two digits

FORMAT Statement • We can specify a format value comparison report. • Caveats: • Limited to space available -- 10 characters max * • This is not big enough for DATETIME • Display the time only portion with either timeampm10. or tod10. • * using SAS 9.4 M4 proccomparebase=m1 compare=m2; format num1 best10. num2 10.8; run;

Truncated Numbers • Report now shows last two digits

MAXPRINT option • By default, the variable comparison report shows… • Max of 50 differences per variable. • Max of 500 total differences total. • You can override the defaults with MAXPRINT. proccomparebase=ds1 compare=ds2 maxprint= (# per variable, # total); run;

MAXPRINT option • If you have a lot variables: • If you want a lot of examples for one variable: proccomparebase=ds1 compare=ds2 maxprint=(10, 500); run; proccomparebase=ds1 (keep=varname1)compare=ds2 (keep=varname1) maxprint=(250, 250); run;

Same but different • Sometimes PROC COMPARE will flag two numbers as different even though they look the same. • Consider these two equations: • X = A * B * C • X = A * (B * C) • We know that mathematically, these are identical

Same but different • As expected, m1 and m2 are identical data m1; set numbers; x= a * b * c; run; data m2; set numbers; x= a *(b * c); run; numbers

Same but different • What happens when we use PROC COMPARE?

Same but different • What is going on here? • A difference of -1.46E-11 ? • -0.0000000000146 • Root cause: binary approximation of floating point • Beyond the scope of this presentation

Same but different • Many things can trigger this small difference • Rounding won’t always fix this • Real problems get buried in the noise • We need to be able to ignore small differences!

METHOD parameter • We can ignore small differences with the METHOD parameter • EXACT (default) • ABSOLUTE – will ignore differences smaller than the value specified by CRITERION • PERCENT and RELATIVE – threshold is based on the magnitude of value that is changing

Same but different • This code will ignore differences smaller than 0.01 proccomparebase=m1 compare=m2 method=absolutecriterion=0.01; run;

Going Beyond PROC COMPARE

Going Beyond PROC COMPARE • PROC COMPARE can do some amazing things, but has its limitations • We can also compare data sets using our own code • Standard DATA steps • PROC SQL

Find Unmatched Observations • Finding Unmatched Observations is a frequently required task • Especially useful with interdependent variables • We can print samples from both data sets with all variables • Use this to determine root cause

Find Unmatched Observations dataunmatched_aunmatched_b; merge data1 (in=in_a) data2 (in=in_b); byunq_id _ALL_; if(notin_b)thenoutputunmatched_a; elseif(not in_a)thenoutputunmatched_b; run; • Replace unq_id with variable(s) that are unique ID. • Both data sets must be sorted by this unique ID

How Does This Work? dataunmatched_aunmatched_b; merge data1 (in=in_a) data2 (in=in_b); byunq_id _ALL_; if(notin_b)thenoutputunmatched_a; elseif(not in_a)thenoutputunmatched_b; run; • Observations that exactly match on all vars in BY statement are merged • _ALL_ means all variables • Unmatched records are output

Find Unmatched Observations • So which observations would be output?

Find Unmatched Observations • So which observations would be output? • unmatched_a has 3 observations • unmatched_b has 2 observations

Find Unmatched Observations • We’ve dropped var4 from second data set. • Now which observations would be output?

Find Unmatched Observations • The variable lists must match • Use KEEP or DROP • Protip: can automate with SQL dictionary table and macro variable ERROR: BY variable var4 is not on input data set WORK.DATA2. NOTE: The SAS System stopped processing this step because of errors.

Quantify the change • PROC COMPARE is good at telling you… • Is it different? • Is it the same? • What if you want to know… • Was the change significant? • Did things get better or worse?

Quantify the change • We have two data sets with a thousand scores… • What can we say about these scores?

Ensuring Data Integrity with PROC COMPARE and Beyond

Ensuring Data Integrity with PROC COMPARE and Beyond

Presentation Transcript

Ensuring Capacity, Integrity and Accountability of Regulators and Supervisors

Ensuring the Integrity of Financial Information

Ensuring Compliance of Patient Data with Big Data and BI

Ensuring Operating System Kernel Integrity with Osck

Ensuring the Accuracy and Integrity o f Publicly Accessible Procurement Data

Ensuring Operating System Kernel Integrity with OSck

Data Integrity

Ensuring File-System Integrity with Declarative Invariants

Compare with/Compare to

University of Florida’s dchecker: Software for ensuring semantic data integrity

Data Integrity

Data Integrity

Module 14 Ensuring Data Integrity through Constraints

Go Compare! Flagging up some under-used options in PROC COMPARE

Data and Metadata Integrity

Managing data integrity with confidence

Data Integrity

Data Integrity

Data Integrity and Security

Ensuring Capacity, Integrity and Accountability of Regulators and Supervisors

Ensuring Academic Integrity with Online Proctoring

Data Integrity