1 / 22

Data Cleaning 101: The Ron Cody Story

Data Cleaning 101: The Ron Cody Story. DAWG April 26, 2007 Katherine Semrau. So, who is Ron Cody anyway?. Professor at RWJ Medical School, expert SAS programmer, SAS book writer and he loves to cycle. 5 rules of data management. What can go wrong, will.

sun
Download Presentation

Data Cleaning 101: The Ron Cody Story

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Cleaning 101:The Ron Cody Story DAWG April 26, 2007 Katherine Semrau

  2. So, who is Ron Cody anyway? Professor at RWJ Medical School, expert SAS programmer, SAS book writer and he loves to cycle

  3. 5 rules of data management • What can go wrong, will. • Nothing is ever as simple as it first appears. • Everything takes longer than you expect. • One size does NOT fit all. • A calm sea does not make a skilled sailor -African proverb -Suzette Levenson’s 5 Rules

  4. Outline • Purpose and Flow • Data cleaning from A to Z • Types of Data Cleaning • SAS Coding & examples • Reporting Errors & Corrections • Conclusions

  5. Purpose of Data Cleaning • To verify that the dataset to be used for analysis accurately reflects the truth

  6. Flow of Data Cleaning

  7. First things first… Data cleanliness starts with: • A good form • Is it clear? Readable? Understandable? In the appropriate language? AND • Legible handwriting

  8. Once data is collected… • Reviewing by the data collector • Checked by 2nd pair of eyes • Both asking the following: • Is the form complete (i.e. all pages)? • Are there blanks? • Are there strange answers? • Are multiple choices made when only one choice is allowed?

  9. On to data entry… • Double Data Entry (CSPro, Access, EpiInfo...) • TeleForms • Data entry specialist will come across queries initially missed

  10. On to data entry… • Queries sent to field site for variables that don’t make sense or are illegible • Original data collector should be asked about the query • Continue with data entry or wait for the query return?

  11. Now to data checking & cleaning… • So now you have a database with data entered…now what? • Where to begin…. • Start small and basic • Then move to more complex cleaning • Get your data dictionary out

  12. Types of data checking • Validity Checks • Range Checks • Logic Checks • Missing Data

  13. What’s in the database?(proc contents) • “proc contents” should be run first to make sure all the variable names expected • Check variable names • Format of variable • Character vs. Numeric

  14. Validity Checks (proc freq) • “proc freq” is your friend • Use the standard “proc freq” just to get the possible answers • Do the answers make sense? • Are there a bunch of missing points? • Are there strange outliers from what is expected? • Example: Gender: M, F

  15. Range Checks (proc freq & proc means) • Do the variables fall within expected limits: • Age: 0-100 years • Weight: 2500g-4000g • Proc freq, proc means (example) • Proc Tabulate

  16. Logic Checks • Use “if..then” statements in the data step • “Where” statements to ensure the skip patterns were met

  17. Duplicate Records • Records mistakenly entered twice • Two options: • NODUPKEY • Eliminates duplicate record by valuables you identify… • But you have to be very careful • Proc sort data=name out=name NODUPKEY; • Proc sort NODUP; by _ALL_; • Eliminates records that are EXACTLY identical for all variables

  18. What to do when you find an error… • Generate report • Send to field staff or review from original charts • Receive answers • Correction of answers • In database or • In coding • Document, document, document

  19. Back to the flow of data cleaning

  20. When is data cleaning done? • Batches of forms are processed or • Time Schedule (i.e. monthly) or…. • NOT just before the analytic dataset is created

  21. When is the data cleaning just as good as it is going to get… • How many rounds do you go? • Who decides how clean it needs to be? • Conversation between data analyst, PI, field staff…

  22. Conclusions • Data cleaning is very important step • Planning is key • Think of it as CSI of the data

More Related