data cleaning hints and tips l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data cleaning: hints and tips PowerPoint Presentation
Download Presentation
Data cleaning: hints and tips

Loading in 2 Seconds...

play fullscreen
1 / 11

Data cleaning: hints and tips - PowerPoint PPT Presentation


  • 158 Views
  • Uploaded on

Data cleaning: hints and tips. Felicity Clemens Stata Users’ Group meeting London, 17 & 18 th May 2005. Introduction. Data cleaning – one of the most time consuming jobs of all! Many ways of attacking the same problem when using Stata

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data cleaning: hints and tips' - elina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data cleaning hints and tips

Data cleaning: hints and tips

Felicity Clemens

Stata Users’ Group meeting

London, 17 & 18th May 2005

Felicity Clemens 18 May 2005

introduction
Introduction
  • Data cleaning – one of the most time consuming jobs of all!
  • Many ways of attacking the same problem when using Stata
  • The talk will describe some common problems and propose possible solutions
  • These are mostly reminders!

Felicity Clemens 18 May 2005

contents
Contents
  • Introduction to the first datasets
  • Identifying and removing duplicates – by hand
  • Merging data and uses of the merge command
  • Generating a moving target variable

Felicity Clemens 18 May 2005

the study
The study
  • A case-control study carried across 3 central European countries
  • Exposure of interest: exposure to chemicals in the environment
  • Outcome of interest: cancer

Felicity Clemens 18 May 2005

identifying duplicates in a dataset
Identifying duplicates in a dataset
  • This can be done automatically (using the duplicates set of commands)
  • We will demonstrate a manual method of identifying duplicates
  • Two different possibilities:
    • The same data have been entered on more than one occasion;

Felicity Clemens 18 May 2005

identifying duplicates in a dataset6
Identifying duplicates in a dataset
  • This can be done automatically (using the duplicates set of commands)
  • We will demonstrate a manual method of identifying duplicates
  • Two different possibilities:
    • The same data have been entered on more than one occasion;
    • Different data have been entered using the same identifier (id numbers)

Felicity Clemens 18 May 2005

the merge command
The merge command

A necessary command in data management of most big studies

There are many different uses of the merge command. We look at two of them:

  • Simple merge on id
  • Multiple merge on id

Felicity Clemens 18 May 2005

identifying a moving target
Identifying a moving target
  • Scenario: we have data for each town giving the chemical concentration for each year between 1982 and 2002
  • Problem: we need to identify the year counting backwards from 2002 in which the chemical changed from its 2002 level
  • Why? We need to overwrite the 2002 value with a new value, and overwrite backwards until the value changed

Felicity Clemens 18 May 2005

identifying a moving target 2
Identifying a moving target (2)

Felicity Clemens 18 May 2005

identifying a moving target 3
Identifying a moving target (3)

We will use the forval loop to examine the relationship between each year’s observed value and the observed value for the previous year

Felicity Clemens 18 May 2005

summary
Summary
  • Identifying duplicates – can be done by hand or automatically using the “duplicates” set of commands
  • Use of the merge command – to merge on a specific variable, to multiply merge datasets
  • Generating a moving target variable – the use of the “forval” loop

Felicity Clemens 18 May 2005