1 / 21

Computing for Research I Spring 2011

Computing for Research I Spring 2011. Expl oring Data with Stata February 14. Primary Instructor: Elizabeth Garrett-Mayer, PhD Today: Joan Cunningham, PhD cunninj@musc.edu. Reminder of Help !!. The most important part Two interactive options: help ‘command’ help ‘search’

raine
Download Presentation

Computing for Research I Spring 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing for Research ISpring 2011 Exploring Data withStata February 14 Primary Instructor: Elizabeth Garrett-Mayer, PhD Today: Joan Cunningham, PhD cunninj@musc.edu

  2. Reminder of Help!! • The most important part • Two interactive options: • help ‘command’ • help ‘search’ • Also LARGE pdfs that link from help files • Plus: • advice • link to Stata • command line help • findit

  3. Dataset for Today:SC breast cancer registry data from 2004 • SCBC2004.v9.dta in your course outline • Recap: all breast cancer cases in SC, 1994 (how do we know the year?) • N = 2633, 55 variables • Demographic and clinical information recorded

  4. Useful for Input • You can input directly from a spreadsheet • Start with an empty data editor • open Stata Data Editor window (icon at top) • “select all” in spreadsheet • Paste in Data Editor • Close data editor • save your new dataset, using “Save as” • NOTE: sometimes date variables give trouble, so check.

  5. Example: SC breast cancer registry data from 2004 • All diagnoses of breast cancer in SC are recorded • N = 2633, 55 variables • Demographic and clinical information recorded • Let’s read it in and explore it • use cd • use insheet • use ‘use’

  6. “egen” with “group” • Same example: egen has a function ‘group’ that can create a new categorical variable from an existing variable, including string variables: • categories are defined by natural groupings within an existing variable: “Create racesex containing values 1, 2, ..., for the groups formed by race and sex and containing missing if race or sex are missing . egenracesex = group(race sex)” • We will use it to generate a categorical variable from a string variable

  7. “egen” with “group” . desc site storage display value variable name type format label variable label site str4 %4s Site in breast . tab site Site in | breast | Freq. Percent Cum. ------------+----------------------------------- C500 | 11 0.42 0.42 C501 | 109 4.14 4.56 C502 | 222 8.43 12.99 C503 | 105 3.99 16.98 C504 | 840 31.90 48.88 C505 | 126 4.79 53.67 C506 | 12 0.46 54.12 C508 | 555 21.08 75.20 C509 | 653 24.80 100.00 ------------+----------------------------------- Total | 2,633 100.00

  8. “egen” with “group” (cont’d) . egensitecat = group(site) . tab sitecat group(site) | Freq. Percent Cum. ------------+----------------------------------- 1 | 11 0.42 0.42 2 | 109 4.14 4.56 3 | 222 8.43 12.99 4 | 105 3.99 16.98 5 | 840 31.90 48.88 6 | 126 4.79 53.67 7 | 12 0.46 54.12 8 | 555 21.08 75.20 9 | 653 24.80 100.00 ------------+----------------------------------- Total | 2,633 100.00 . descsite* storage display value variable name type format label variable label ------------------------------------------------------------------------------- site str4 %4s Site in breast sitecat float %9.0g group(site)

  9. “egen” with “group” (cont’d) . label define sitecat 1 "C500" 2 "C501" 3 "C502" 4 "C503" 5 "C504" 6 "C505" 7 "C506" 8 "C507" 9 "C508" . label values sitecatsitecat . tab sitecat Site | categories, | from site | Freq. Percent Cum. ------------+----------------------------------- C500 | 11 0.42 0.42 C501 | 109 4.14 4.56 C502 | 222 8.43 12.99 C503 | 105 3.99 16.98 C504 | 840 31.90 48.88 C505 | 126 4.79 53.67 C506 | 12 0.46 54.12 C507 | 555 21.08 75.20 C508 | 653 24.80 100.00 ------------+----------------------------------- Total | 2,633 100.00

  10. “egen” with “group” (cont’d) . tab sitecat, nolabel Site | categories, | from site | Freq. Percent Cum. ------------+----------------------------------- 1 | 11 0.42 0.42 2 | 109 4.14 4.56 3 | 222 8.43 12.99 4 | 105 3.99 16.98 5 | 840 31.90 48.88 6 | 126 4.79 53.67 7 | 12 0.46 54.12 8 | 555 21.08 75.20 9 | 653 24.80 100.00 ------------+----------------------------------- Total | 2,633 100.00 . tab sitecat Site | categories, | from site | Freq. Percent Cum. ------------+----------------------------------- C500 | 11 0.42 0.42 C501 | 109 4.14 4.56 C502 | 222 8.43 12.99 C503 | 105 3.99 16.98 C504 | 840 31.90 48.88 C505 | 126 4.79 53.67 C506 | 12 0.46 54.12 C507 | 555 21.08 75.20 C508 | 653 24.80 100.00 ------------+----------------------------------- Total | 2,633 100.00 . desc site* storage display value variable name type format label variable label ------------------------------------------------------------------------------- site str4 %4s Site in breast sitecat float %9.0g group(site)

  11. Histogram • Command line, or Graphics window . gr_example sp500: histogram volume

  12. Boxplot . graph box age, by race

  13. Spikeplot • spikeplot age, title(Age at Diagnosis)

  14. Stem-and-leaf • stem age • stem age, lines(2) • stem age, lines(1) • tab age5cat2

  15. Tables for epidemiologists (cc and cci) • 2 x 2 tables • Requires variables in form of 0,1 . tab nodecat black . tab nodecatblack if nodecat <9 black white nodes | 0 1 | Total -----------+----------------------+---------- 0 | 1,045 278 | 1,323 1 | 600 223 | 823 -----------+----------------------+---------- Total | 1,645 501 | 2,146

  16. “cc” cc is used with case-control and cross-sectional data. Point estimates and confidence intervals for the odds ratio are calculated, along with attributable or prevented fractions for the exposed and total population. cci is the immediate form of cc

  17. “cc” and “tab” . tab nodecat black if nodecat <9, chi . tab nodecatblack if nodecat <9, chi col . tab nodecat black if nodecat <9, chi col nofreq . cc black nodecat . cc black nodecat if nodecat <9 Compare results using “tab …, chi” and “cc”

  18. “cc” and “tab” . tab nodecat black if nodecat <9, chi col nofreq | 0=White (EA); 1=Black 0=stage N0 | (AA) 1=N1 | 0 1 | Total -----------+----------------------+---------- 0 | 63.53 55.49 | 61.65 1 | 36.47 44.51 | 38.35 -----------+----------------------+---------- Total | 100.00 100.00 | 100.00 Pearson chi2(1) = 10.4916 Pr = 0.001 . cc black nodecat if nodecat <9 Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+------------------------ Cases | 223 278 | 501 0.4451 Controls | 600 1045 | 1645 0.3647 -----------------+------------------------+------------------------ Total | 823 1323 | 2146 0.3835 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Odds ratio | 1.397092 | 1.134208 1.719531 (exact) Attr. frac. ex. | .2842277 | .1183272 .4184462 (exact) Attr. frac. pop | .1265125 | +------------------------------------------------- chi2(1) = 10.49 Pr>chi2 = 0.0012

  19. “ir” ir is used with incidence-rate (incidence-density or person-time) data. It calculates point estimates and confidence intervals for the incidence-rate ratio and difference, along with attributable or prevented fractions for the exposed and total population. iri is the immediate form of ir

  20. “ir” . webuserm . list . ir deaths male p . ir deaths male pyears, by(age) level(90)years, by(age)

  21. “mcc” • The mcc and mcci commands only allow 1:1 matching. • For n:1 matching, you must use clogit (conditional logistic regression) in Stata. • mcc is nonparametric and the equivalent for n:1 matching is much more complex. Usually, conditional logistic regression is used for n:1 matching and there is no need to go to nonparametric methods unless your total sample size is small (< ∼30).

More Related