1 / 47

Generating new variables and manipulating data with STATA

Generating new variables and manipulating data with STATA. Biostatistics 212 Lecture 3. Housekeeping. Do and Log files Understand each command! Order them appropriately. cd “C:databiostat212” capture log close log using “filename.log”, replace set more off version 12 capture clear

malana
Download Presentation

Generating new variables and manipulating data with STATA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

  2. Housekeeping • Do and Log files • Understand each command! • Order them appropriately

  3. cd “C:\data\biostat212\” capture log close log using “filename.log”, replace set more off version 12 capture clear capture clear matrix set memory 10m /* Comments here about what the do file does */ use “datasetname.dta”, clear describe, short *** Analysis * Add stuff here log close set more on

  4. cd “C:\data\biostat212\” capture log close log using “filename.log”, replace set more off version 12 capture clear capture clear matrix set memory 10m /* Comments here about what the do file does */ use “datasetname.dta”, clear describe, short *** Analysis * Add stuff here log close set more on

  5. cd “C:\data\biostat212\” capture log close log using “filename.log”, replace set more off version 12 capture clear capture clear matrix set memory 10m /* Comments here about what the do file does */ use “datasetname.dta”, clear describe, short *** Analysis * Add stuff here log close set more on

  6. cd “C:\data\biostat212\” capture log close log using “filename.log”, replace set more off version 12 capture clear capture clear matrix set memory 10m /* Comments here about what the do file does */ use “datasetname.dta”, clear describe, short *** Analysis * Add stuff here log close set more on No longer necessary in Stata 12

  7. Tells stata where to go, by default, for files files Closes a log if there is one open, or “captures” the error that would occur if there isn’t one open Opens a new log file Tells Stata not to pause for voluminous output cd “C:\data\biostat212\” capture log close log using “filename.log”, replace set more off version 12 capture clear capture clear matrix set memory 10m /* Comments here about what the do file does */ use “datasetname.dta”, clear describe, short *** Analysis * Add stuff here log close set more on So your do file still runs in “Stata 15” after commands have changed syntax No longer necessary in Stata 12 To remind yourself and/or communicate to others Opens your dataset Describes your dataset, with date last saved The stuff you actually want to get done and document! Close the log file Reset to default Stata setting (pause for voluminous output)

  8. cd “C:\data\biostat212\” capture log close log using “filename.log”, replace set more off version 12 capture clear capture clear matrix set memory 10m /* Comments here about what the do file does */ use “datasetname.dta”, clear describe, short *** Analysis * Add stuff here log close set more on • For today’s lab… • substitute: • add: clear insheet using “data.csv” [data cleaning commands] save “cleandata.dta”

  9. Housekeeping • Any other questions?

  10. Today... • What we did in Lab 1, and why it was unrealistic • What does “data cleaning” mean? • Importing data into Stata • How to generate a variable • How to manipulate the data in your new variable • How to label variables and otherwise document your work • Examples

  11. Last time… • What was unrealistic?

  12. Last time… • What was unrealistic? • The dataset came as a Stata .dta file

  13. Last time… • What was unrealistic? • The dataset came as a Stata .dta file • The variables were ready to analyze

  14. Last time… • What was unrealistic? • The dataset came as a Stata .dta file • The variables were ready to analyze • Most variables were labeled

  15. Last time… • i.e. – The data was “clean”

  16. How your data will arrive • On paper forms • In a text file (comma or tab delimited) • In Excel • In Access • In another data format (SAS, etc)

  17. Importing into Stata • From “delimited” text file: • Import directly using insheet, infile, or other flexible Stata commands (e.g., insheet using filename.csv) • From Excel or Access: • Save as a delimited text file (.csv) and import as above • Import directly using the import command (new in Stata 12) • Copy and Paste – Use Extreme Caution! • From SAS or other analysis program: • Export a delimited text file from within SAS and import as above • Use StatTransfer to directly convert into Stata dataset (or Access, etc)

  18. Importing into Stata Delimited text file: id,age,htn,dm,smoke,chol,cac,dead,obstime,anycac,caccat,agecat,male,ageover60,sex 1,25,0,0,0,0,0,0,1125,0,"0","<30",1,0,"male" 2,25,0,0,0,1,0,0,1125,0,"0","<30",0,0,"female" 3,25,1,0,1,0,0,0,1064,0,"0","<30",0,0,"female" 4,25,1,0,1,0,0,0,1064,0,"0","<30",0,0,"female" 5,25,0,0,0,0,0,0,913,0,"0","<30",1,0,"male" 6,25,0,0,1,0,0,0,852,0,"0","<30",0,0,"female" 7,25,0,0,0,1,3,0,699,1,"1-99","<30",1,0,"male" 8,26,0,0,0,1,0,0,1674,0,"0","<30",1,0,"male" 9,26,1,0,1,0,0,0,1460,0,"0","<30",0,0,"female" 10,26,1,0,1,0,0,0,1460,0,"0","<30",0,0,"female" 11,26,0,0,0,0,0,0,1368,0,"0","<30",1,0,"male" 12,26,0,0,0,0,0,0,1156,0,"0","<30",1,0,"male“ Etc Could use other characters to “delimit” (tab, |, etc; nothing that’s actually IN a variable value…) File extensions can be .txt, .csv, .raw, and other…

  19. Importing into Stata • Importing text data using a Stata command • NICE because you can do within a do file • insheet if you have delimited data • infile if you have fixed character length data • Watch out for variable names (in first row or not?) • Watch out for variable labels (import the labels instead of values? Don’t want to lose the labels if possible…) • Use “import” from the “File” dropdown menu to help build the command and find helpful options

  20. Importing into Stata • Whatever you do, make sure it worked! • Look at the data

  21. Importing into Stata • Demo – neonatal opiate withdrawal data • Import with insheet (save as .csv file first) • Import with import • Import with cut and paste from Excel - DANGER

  22. Exploring your data • Figure out what all those variables mean • Options • Browse, describe, summarize, list in STATA • Refer to a data dictionary • Refer to a data collection form • Guess, or ask the person who gave it to you

  23. Exploring your data • Demo: Neonatal opiate withdrawal data

  24. Exploring your data • Demo: Neonatal opiate withdrawal data • Problems arise… • Many variable names are obscure • Sex is m/f, not 1/0 • Gestational age has nonsense values (0, 60) • “Matdose” mostly numbers, but some long text values • Drug variables coded y or blank • Extra blank observations and variables

  25. Cleaning your data • You must “clean” your data so it is ready to analyze.

  26. Cleaning your data • What does the variable measure? • rename and/or label var so it’s clear • Find nonsense values and outliers • track down real value, recode as missing, “trim”, etc? • Deal with missing values • Too many values missing in some subjects? Coding consistent? • drop variable or observation? Impute? • Categorize as needed • generate a new numeric variable • recode (dichotomous variables coded as 1/0, watch missing values) • label define and then label values • Check • tab oldvar newvar, missing • bysort catvar: sum contvar

  27. Cleaning your data • The importance of documentation • Retracing your steps • Document every step using a “do” file

  28. Data cleaningBasic skill 1 – Making a new variable • Creating new variables generate newvar = expression

  29. Data cleaningBasic skill 1 – Making a new variable • Creating new variables generate newvar = expression • An “expression” can be: • A number (constant) - generate allzeros = 0 • A variable - generate ageclone = age • A function - generate agesqrt = sqrt(age)+1

  30. Data cleaningBasic skill 2 – Manipulating values of a variable • Changing the values of a variable replace var = exp [if boolean_expression] • A boolean expression evaluates to “true” or “false” for each observation

  31. Data cleaningBasic skill 2 – Manipulating values of a variable • Examples generate bmi = weight/(height^2) generate male = 0 replace male = 1 if sex=="male" generate ageover50 = 0 replace ageover50 = 1 if age>50 generate complexvar = age replace complexvar = (ln(age)*3) if (age>30 | male==1) & (othervar1>=othervar2)

  32. Data cleaningBasic skill 2 – Manipulating values of a variable • Logical operators for boolean expressions: EnglishStata Equal to == Not equal to !=, ~= Greater than > Greater than/equal to >= Less than < Less than/equal to <= And & Or |

  33. Data cleaningBasic skill 2 – Manipulating values of a variable • Logical operators for boolean expressions: EnglishStata Equal to == Not equal to !=, ~= Greater than > Greater than/equal to >= Less than < Less than/equal to <= And & Or | What out for “type mismatch” errors: ==“string_value” ==numeric_value

  34. Data cleaningBasic skill 2 – Manipulating values of a variable • Mathematical operators: EnglishStata Add + Subtract - Multiply * Divide / To the power of ^ Natural log of ln(expression) Base 10 log of log10(expression) Etcetera…

  35. Data cleaningBasic skill 2 – Manipulating values of a variable • Another way to manipulate data recode var oldvalue1=newvalue1 [oldvalue2=newvalue2] [if boolean_expression] • More complicated, but more flexible command than replace

  36. Data cleaningBasic skill 2 – Manipulating values of a variable • Examples generate male = 0 recode male 0=1 if sex=="male" generate female = male recode female 1=0 0=1

  37. Data cleaningBasic skill 2 – Manipulating values of a variable • Examples generate raceethnic = race recode raceethnic 1=6 if ethnic=="hispanic" equivalent to: replace raceethnic = 6 if ethnic=="hispanic" & race==1 generate tertilescac = cac recode tertilescac min/54=1 55/82=2 83/max=3

  38. Data cleaningBasic skill 3 – Getting rid of variables/observations • Getting rid of a variable drop var • Getting rid of observations drop if boolean_expression

  39. Data cleaningBasic skill 4 – Labeling things • You can label: • A dataset label data “label” • A variable label var varname “label” • Values of a variable (2-step process) label define labelnamevalue1 “label1” [value2 “value2”…] label values varnamelabelname label define caccatlabel 0 “0” 1 “1-100” 2 “101-400” 3 “>400” label values caccat caccatlabel

  40. Data cleaningBasic skill 5 –Dealing with missing values • Missing values are important, easy to forget • . for numbers • “” for text • tab var1 var2 [, missing] • Watch the total “n” for tab, summarize commands, regression analyses, etc.

  41. Data cleaning • Demo: Neonatal opiate withdrawal data

  42. Data cleaning • Demo: Neonatal opiate withdrawal data • Drop nonsense rows and columns • Clean up year, sex, ga, bw, matdose, cocaine • Label, outliers, missing values, categorize/label values, check • Generate low birthweight or premature variable • One way to convert matdose to numeric • destring matdose, generate(matdosenum) force

  43. Cleaning your data (recap) • What does the variable measure? • rename or label var so it’s clear • Find nonsense values and outliers • recode as missing or track down real value? • Deal with missing values • Too many for some observations? Coding consistent? • drop variable or observation? • Categorize as needed • generate a new numeric variable • recode (dichotomous variables coded as 1/0, watch missing values) • label define and then label values • Check • tab oldvar newvar, missing • bysort catvar: sum contvar

  44. Data cleaning • At the end of the day you have: • 1 raw data file, original format • 1 do file that cleans it up • 1 log file that documents the cleaning • 1 clean data file, Stata format

  45. Summary • Data cleaning • ALWAYS necessary to some extent • ALWAYS use a do file • NEVER overwrite original data • Check your work • Watch out for missing values • Label as much as you can

  46. Lab this week • It’s long • It’s hard • It’s important • Email lab to your section leader’s email • Due at the beginning of lecture next week

  47. Preview of next week… • Using Excel • What is it good for? • Formulas • Designing a good spreadsheet • Formatting

More Related