1 / 35

Generating new variables and manipulating data with STATA

Generating new variables and manipulating data with STATA. Biostatistics 212 Lecture 3. Housekeeping. Questions re: Log and Do files?. Today. What we did in Lab 1, and why it was unrealistic What does “data cleaning” mean? How to generate a variable

romney
Download Presentation

Generating new variables and manipulating data with STATA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

  2. Housekeeping • Questions re: Log and Do files?

  3. Today... • What we did in Lab 1, and why it was unrealistic • What does “data cleaning” mean? • How to generate a variable • How to manipulate the data in your new variable • How to label variables and otherwise document your work • Examples

  4. Last time… • What was unrealistic?

  5. Last time… • What was unrealistic? • The dataset came as a Stata .dta file

  6. Last time… • What was unrealistic? • The dataset came as a Stata .dta file • The variables were ready to analyze

  7. Last time… • What was unrealistic? • The dataset came as a Stata .dta file • The variables were ready to analyze • Most variables were labeled

  8. Last time… • i.e. – The data was “clean”

  9. How your data will arrive • On paper forms • In a text file (comma or tab delimited) • In Excel • In Access • In another data format (SAS, etc)

  10. Importing into Stata • Options: • Copy and Paste • insheet, infile, fdause, other flexible Stata commands • A convenience program like “Stat/Transfer”

  11. Importing into Stata • Make sure it worked • Look at the data

  12. Importing into Stata • Demo – neonatal opiate withdrawal data

  13. Exploring your data • Figure out what all those variables mean • Options • Browse, describe, summarize, list in STATA • Refer to a data dictionary • Refer to a data collection form • Guess, or ask the person who gave it to you

  14. Exploring your data • Demo: Neonatal opiate withdrawal data

  15. Exploring your data • Demo: Neonatal opiate withdrawal data • Problems arise… • Sex is m/f, not 1/0 • Gestational age has nonsense values (0, 60) • Breastfeeding has a bunch of weird text values • Drug variables coded y or blank • Many variable names are obscure

  16. Cleaning your data • You must “clean” your data so it is ready to analyze.

  17. Cleaning your data • Cleaning tasks • Check for consistency and clean up nonsense data and outliers • Deal with missing values • Code all dichotomous variables 1/0 • Categorize variables as needed (for Table 1, etc) • Derive new variables • Rename variables • With common sense, or with a consistent scheme • Label variables • Label the VALUES of coded variables

  18. Cleaning your data • The importance of documentation • Retracing your steps • Document every step using a “do” file

  19. Data cleaningBasic skill 1 – Making a new variable • Creating new variables generate newvar = expression

  20. Data cleaningBasic skill 1 – Making a new variable • Creating new variables generate newvar = expression • An “expression” can be: • A number (constant) - generate allzeros = 0 • A variable - generate ageclone = age • A function - generate agesqrt = sqrt(age)

  21. Data cleaningBasic skill 2 – Getting rid of variables/observations • Getting rid of a variable drop var • Getting rid of observations drop if boolean exp

  22. Data cleaningBasic skill 3 – Manipulating values of a variable • Changing the values of a variable replace var = exp [if boolean exp] • A boolean expression evaluates to true or false for each observation

  23. Data cleaningBasic skill 3 – Manipulating values of a variable • Examples generate male = 0 replace male = 1 if sex==“male” generate ageover50 = 0 replace ageover 50 = 1 if age>50 generate complexvar = age replace complexvar = (ln(age)*3) if (age>30 | male==1) & (othervar1>=othervar2)

  24. Data cleaningBasic skill 3 – Manipulating values of a variable • Logical operators for boolean expressions: EnglishStata Equal to == Not equal to !=, ~= Greater than > Greater than/equal to >= Less than < Less than/equal to <= And & Or |

  25. Data cleaningBasic skill 3 – Manipulating values of a variable • Mathematical operators: EnglishStata Add + Subtract - Multiply * Divide / To the power of ^ Natural log of ln(expression) Base 10 log of log10(expression) Etcetera…

  26. Data cleaningBasic skill 3 – Manipulating values of a variable • Another way to manipulate data Recode var oldvalue1=newvalue1 [oldvalue2=newvalue2] [if boolean expression] • More complicated, but more flexible command than replace

  27. Data cleaningBasic skill 3 – Manipulating values of a variable • Examples Generate male = 0 Recode male 0=1 if sex==“male” Generate raceethnic = race Recode raceethnic 1=6 if ethnic==“hispanic” (Replace raceethnic = 6 if ethnic==“hispanic” & race==1) Generate tertilescac = cac Recode min/54=1 55/82=2 83/max=3

  28. Data cleaningBasic skill 4 – Labeling things • You can label: • A dataset label data “label” • A variable label var varname “label” • Values of a variable (2-step process) label define labelnamevalue1 “label1” [value2 “value2”…] Label values varnamelabelname

  29. Data cleaningBasic skill 5 –Dealing with missing values • Missing values are important, easy to forget • . for numbers • “” for text • tab var1 var2, missing • Watch the total “n” for tab, summarize commands, regression analyses, etc.

  30. Data cleaning • Demo: Neonatal opiate withdrawal data

  31. Cleaning your data • Cleaning tasks • Check for consistency and clean up non-sense data • Deal with missing values • Code all dichotomous variables 1/0 • Categorize variables meaningfully (for Table 1, etc) • Derive new variables • Rename variables • With common sense, or with a consistent scheme • Label variables • Label the VALUES of coded variables

  32. Data cleaning • At the end of the day you have: • 1 raw data file, original format • 1 raw data file, Stata format • 1 do file that cleans it up • 1 log file that documents the cleaning • 1 clean data file, Stata format

  33. Summary • Data cleaning • ALWAYS necessary to some extent • ALWAYS use a do file • NEVER overwrite original data • Check your work • Watch out for missing values • Label as much as you can

  34. Lab this week • It’s long • It’s hard • It’s important • Email lab to your section leader’s email • Due at the beginning of lecture next week

  35. Preview of next week… • Using Excel • What is it good for? • Formulas • Designing a good spreadsheet • Formatting

More Related