Computing for Research I Spring 2011

Computing for Research ISpring 2011 Exploring Data withStata February 14 Primary Instructor: Elizabeth Garrett-Mayer, PhD Today: Joan Cunningham, PhD cunninj@musc.edu

Reminder of Help!! • The most important part • Two interactive options: • help ‘command’ • help ‘search’ • Also LARGE pdfs that link from help files • Plus: • advice • link to Stata • command line help • findit

Dataset for Today:SC breast cancer registry data from 2004 • SCBC2004.v9.dta in your course outline • Recap: all breast cancer cases in SC, 1994 (how do we know the year?) • N = 2633, 55 variables • Demographic and clinical information recorded

Useful for Input • You can input directly from a spreadsheet • Start with an empty data editor • open Stata Data Editor window (icon at top) • “select all” in spreadsheet • Paste in Data Editor • Close data editor • save your new dataset, using “Save as” • NOTE: sometimes date variables give trouble, so check.

Example: SC breast cancer registry data from 2004 • All diagnoses of breast cancer in SC are recorded • N = 2633, 55 variables • Demographic and clinical information recorded • Let’s read it in and explore it • use cd • use insheet • use ‘use’

“egen” with “group” • Same example: egen has a function ‘group’ that can create a new categorical variable from an existing variable, including string variables: • categories are defined by natural groupings within an existing variable: “Create racesex containing values 1, 2, ..., for the groups formed by race and sex and containing missing if race or sex are missing . egenracesex = group(race sex)” • We will use it to generate a categorical variable from a string variable

“egen” with “group” . desc site storage display value variable name type format label variable label site str4 %4s Site in breast . tab site Site in | breast | Freq. Percent Cum. ------------+----------------------------------- C500 | 11 0.42 0.42 C501 | 109 4.14 4.56 C502 | 222 8.43 12.99 C503 | 105 3.99 16.98 C504 | 840 31.90 48.88 C505 | 126 4.79 53.67 C506 | 12 0.46 54.12 C508 | 555 21.08 75.20 C509 | 653 24.80 100.00 ------------+----------------------------------- Total | 2,633 100.00

“egen” with “group” (cont’d) . egensitecat = group(site) . tab sitecat group(site) | Freq. Percent Cum. ------------+----------------------------------- 1 | 11 0.42 0.42 2 | 109 4.14 4.56 3 | 222 8.43 12.99 4 | 105 3.99 16.98 5 | 840 31.90 48.88 6 | 126 4.79 53.67 7 | 12 0.46 54.12 8 | 555 21.08 75.20 9 | 653 24.80 100.00 ------------+----------------------------------- Total | 2,633 100.00 . descsite* storage display value variable name type format label variable label ------------------------------------------------------------------------------- site str4 %4s Site in breast sitecat float %9.0g group(site)

“egen” with “group” (cont’d) . label define sitecat 1 "C500" 2 "C501" 3 "C502" 4 "C503" 5 "C504" 6 "C505" 7 "C506" 8 "C507" 9 "C508" . label values sitecatsitecat . tab sitecat Site | categories, | from site | Freq. Percent Cum. ------------+----------------------------------- C500 | 11 0.42 0.42 C501 | 109 4.14 4.56 C502 | 222 8.43 12.99 C503 | 105 3.99 16.98 C504 | 840 31.90 48.88 C505 | 126 4.79 53.67 C506 | 12 0.46 54.12 C507 | 555 21.08 75.20 C508 | 653 24.80 100.00 ------------+----------------------------------- Total | 2,633 100.00

“egen” with “group” (cont’d) . tab sitecat, nolabel Site | categories, | from site | Freq. Percent Cum. ------------+----------------------------------- 1 | 11 0.42 0.42 2 | 109 4.14 4.56 3 | 222 8.43 12.99 4 | 105 3.99 16.98 5 | 840 31.90 48.88 6 | 126 4.79 53.67 7 | 12 0.46 54.12 8 | 555 21.08 75.20 9 | 653 24.80 100.00 ------------+----------------------------------- Total | 2,633 100.00 . tab sitecat Site | categories, | from site | Freq. Percent Cum. ------------+----------------------------------- C500 | 11 0.42 0.42 C501 | 109 4.14 4.56 C502 | 222 8.43 12.99 C503 | 105 3.99 16.98 C504 | 840 31.90 48.88 C505 | 126 4.79 53.67 C506 | 12 0.46 54.12 C507 | 555 21.08 75.20 C508 | 653 24.80 100.00 ------------+----------------------------------- Total | 2,633 100.00 . desc site* storage display value variable name type format label variable label ------------------------------------------------------------------------------- site str4 %4s Site in breast sitecat float %9.0g group(site)

Histogram • Command line, or Graphics window . gr_example sp500: histogram volume

Boxplot . graph box age, by race

Spikeplot • spikeplot age, title(Age at Diagnosis)

Stem-and-leaf • stem age • stem age, lines(2) • stem age, lines(1) • tab age5cat2

Tables for epidemiologists (cc and cci) • 2 x 2 tables • Requires variables in form of 0,1 . tab nodecat black . tab nodecatblack if nodecat <9 black white nodes | 0 1 | Total -----------+----------------------+---------- 0 | 1,045 278 | 1,323 1 | 600 223 | 823 -----------+----------------------+---------- Total | 1,645 501 | 2,146

“cc” cc is used with case-control and cross-sectional data. Point estimates and confidence intervals for the odds ratio are calculated, along with attributable or prevented fractions for the exposed and total population. cci is the immediate form of cc

“cc” and “tab” . tab nodecat black if nodecat <9, chi . tab nodecatblack if nodecat <9, chi col . tab nodecat black if nodecat <9, chi col nofreq . cc black nodecat . cc black nodecat if nodecat <9 Compare results using “tab …, chi” and “cc”

“cc” and “tab” . tab nodecat black if nodecat <9, chi col nofreq | 0=White (EA); 1=Black 0=stage N0 | (AA) 1=N1 | 0 1 | Total -----------+----------------------+---------- 0 | 63.53 55.49 | 61.65 1 | 36.47 44.51 | 38.35 -----------+----------------------+---------- Total | 100.00 100.00 | 100.00 Pearson chi2(1) = 10.4916 Pr = 0.001 . cc black nodecat if nodecat <9 Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+------------------------ Cases | 223 278 | 501 0.4451 Controls | 600 1045 | 1645 0.3647 -----------------+------------------------+------------------------ Total | 823 1323 | 2146 0.3835 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Odds ratio | 1.397092 | 1.134208 1.719531 (exact) Attr. frac. ex. | .2842277 | .1183272 .4184462 (exact) Attr. frac. pop | .1265125 | +------------------------------------------------- chi2(1) = 10.49 Pr>chi2 = 0.0012

“ir” ir is used with incidence-rate (incidence-density or person-time) data. It calculates point estimates and confidence intervals for the incidence-rate ratio and difference, along with attributable or prevented fractions for the exposed and total population. iri is the immediate form of ir

“ir” . webuserm . list . ir deaths male p . ir deaths male pyears, by(age) level(90)years, by(age)

“mcc” • The mcc and mcci commands only allow 1:1 matching. • For n:1 matching, you must use clogit (conditional logistic regression) in Stata. • mcc is nonparametric and the equivalent for n:1 matching is much more complex. Usually, conditional logistic regression is used for n:1 matching and there is no need to go to nonparametric methods unless your total sample size is small (< ∼30).

Computing for Research I Spring 2011