370 likes | 598 Views
Advanced Stata Workshop. Nealia Khan Tom Tomberlin Learning Technologies Center Harvard University, Graduate School of Education. Contact Information. Located in Gutman Library, 3 rd floor Contact us at: stathelp@gse.harvard.edu
E N D
Advanced Stata Workshop Nealia Khan Tom Tomberlin Learning Technologies Center Harvard University, Graduate School of Education
Contact Information • Located in Gutman Library, 3rd floor • Contact us at: stathelp@gse.harvard.edu • Can make an appointment or have us respond to your request via email
Generating Variables • Generate (gen) – allows the user to create or change the contents of a variable. The generate command allows for the use of mathematical functions and conditional statements. There are also many specific commands which can be included in the gen statement. • Extensions to generate (egen) – egen allows the use of some specific functions in the creation of new variables. Egen cannot be used interchangeably with the gen command – you must use egen specific functions when using this command.
Gen and Egen Functions • group • concat • cond
Group • Assign a unique, three digit numeric value to each district and school name • Code: sort district egen group = group(district) sort group gen districtid = group+100 drop group
Apply Your Knowledge • Generate the variable schoolid that is a three digit number beginning with 301 which uniquely identifies each school in the schname variable. • Hint: Sorting by both district and schname will keep schools within a district together in the id creation.
Concat • Join the two newly generated ids (district and school) into one six digit number that uniquely identifies each school. • Code: egen id = concat(districtid schoolid)
Condition (cond) • We can use the cond function of the generate command to identify the number of duplicate observations of an id number. • Code: sort id race female quietly by id: gen dup = cond(_N==1,0,_n)
Apply Your Knowledge • Check to see how many duplicates there are for each value of id. • Create the variable student and assign a value of 0 if there are 3 or fewer duplicates of id, a value of 1 if there are 4-6 duplicates. • Generate a new variable, studentid, which joins id and student together. • Drop the id, dup, and student variables from the dataset.
Forming Composites • We can create a new variable, risk, that is the sum of the four dummy variables. • Code: • gen risk = lep + sped + lo_read + frlunch • We can eliminate the missing values problem with an option in the egen command. • Code: • egen risk = rowtotal(lep sped lo_readfrlunch)
Forming Composites • We can use principal components analysis (PCA) to generate weights for each of the items in risk. The predict command will generate a value of risk for each observation in the dataset. • Code: • pcalep sped lo_readfrlunch • predict risk • browse lep sped lo_readfrlunch risk
Categorical Variables • Our dataset contains the categorical variable race. • We can deal with this variable in one of two ways: • Form dummy variables for each race subgroup. Code: gen race1=race if race==1 replace race1=0 if race~=1
Categorical Variables • Our dataset contains the categorical variable race. • We can deal with this variable in one of two ways: • Indicate the categorical nature of the variable in the regression model. Code: regress mathrawi.race
Interactions with Categorical Variables • Does the effect of risk vary by racial group? • xi3 allows us to form interactions from within the regression model. Code: xi3: regress mathraw i.race*risk
Apply Your Knowledge • Use the xi3 function to fit a regression model that includes interaction effects between race and class size. Do we see a differential effect of class size for any race groups? • Use the xi3 function to fit a regression model that includes interaction effects between race and both class size and time on the bus.
Creating Regression Tables • Stata has the ability to create formatted tables for regression models. • These tables can be created within the Stata program or exported to a text format. • Both methods of table creation rely on the estimate store (eststo) command in Stata
Creating Regression Tables • There are two methods of storing the estimates of a regression model into eststo: • Invoking eststo immediately after a regression procedure and assigning a name for the stored values. Code: regress mathrawi.race risk eststo m1, title(Model 1)
Creating Regression Tables • There are two methods of storing the estimates of a regression model into eststo: • Using eststo: before a regression model. Stata will automatically assign consecutive model numbers to the stored values. Code: eststo: regress mathraw risk (est1 stored)
Creating Regression Tables • There are two commands that allow you to access the information in the estimate store memory – estout and esttab. • estout is the most flexible in its ability to modify the appearance of the formatted regression table, but it also requires more programming code to achieve APA style tables. • esttab is “wrapper” for estout and simplifies the coding process.
ESTOUT • Code: eststo: xi3: regress mathraw i.race (est1 stored) eststo: xi3: regress mathraw i.race risk class_sz bus_time (est2 stored) eststo: xi3: regress mathraw i.race*risk class_sz bus_time (est3 stored)
ESTOUT • Code: estout est1 est2 est3 • Now let’s try to format this table into something suitable for a research paper: • Code: estout using models_out.rtf, cells(b(star fmt(3)) se(par fmt(2))) legend label title(Regression Models) mlabels("Model A" "Model B" "Model C") varlabels(_cons INTERCEPT) stats(N r2 df_r, fmt(0 3 0) label (N R2 DF)) style(fixed) • Here is the result of this code:
ESTTAB • Code: esttab • We can make modifications to the standard esttab table: • Code: esttab using models.rtf, se r2 ar2 label title({\b Table 1.} {\i Hierarchy of Fitted Models}) nonumbers mtitles("Model A" "Model B" "Model C") varlabels(_cons INTERCEPT) order( _Irace_2 _Irace_3 _Irace_4 _Irace_5 class_sz bus_time risk _Ira2Xri _Ira3Xri _Ira4Xri _Ira5Xri) style(fixed) • Here is the result of this code:
ESTTAB • Here is the code for estout to produce the same table we just created in esttab: • estout using `"models1.rtf"' , cells(b(fmt(a3) star) se(fmt(a3) par)) stats(N r2 r2_a, fmt(%18.0g 3 3) labels(`"Observations"' `"{\i R}{\super 2}"' `"Adjusted {\i R}{\super 2}"')) starlevels("{\super *}" 0.05 "{\super **}" 0.01 "{\super ***}" 0.001, label(" {\i p} < ")) varwidth(20) modelwidth(12) begin({\trowd\trgaph108\trleft-108@rtfrowdefbrdr\pard\intbl\ql {) delimiter(}\cell \pard\intbl\qc {) end(}\cell\row}) title({\b Table 1.} {\i Hierarchy of Fitted Models}) prehead(`"{\rtf1\ansi\deff0 {\fonttbl{\f0\fnil Times New Roman;}}"' `"{\info {\author .}{\company .}{\title .}{\creatim\yr2010\mo3\dy31\hr14\min14}}"' `"\deflang1033\plain\fs24"' `"{\footer\pard\qc\plain\f0\fs24\chpgn\par}"' `"{\pard\keepn\ql @title\par}"' {) posthead() prefoot() postfoot(`"{\pard\ql\fs20 Standard errors in parentheses\par}"' `"{\pard\ql\fs20 @starlegend\par}"' } `"{\pard \par}"' `"}"') label varlabels(_cons INTERCEPT) mlabels("Model A" "Model B" "Model C",) nonumberscollabels(, none) eqlabels(, begin("{\trowd\trgaph108\trleft-108@rtfrowdefbrdrt\pard\intbl\ql {") replace nofirst) notype level(95) replace order( _Irace_2 _Irace_3 _Irace_4 _Irace_5 class_szbus_time risk _Ira2Xri _Ira3Xri _Ira4Xri _Ira5Xri) style(fixed)
Graphing - Scatterplots • Bivariate Scatterplot – mathraw on risk • Code: scatter mathraw risk • Since risk is essentially a bin, the graph will have a “stacked” appearance to it. We can lessen this effect with the jitter option. • Code: scatter mathraw risk, jitter(4) • Here is the resulting graph:
Graphing - Scatterplots • Now let’s add a fitted trend line to the scatterplot of mathraw on risk. • Code: twoway scatter mathraw risk, jitter(4) || lfit mathraw risk • The lfit option gives us a linear fitted trend line. There are two other fit options for the trend line – qfit (quadratic fit) and fpfit (fractional polynomial fit). • Here are three graphs that illustrate these three fitted line options:
Graphing - Scatterplots • We can easily add 95% confidence intervals to the fitted trend line for any of the fitted trend line options. • Code: • twoway scatter mathraw risk, jitter(4) || lfitcimathraw risk, ciplot(rline) • twoway scatter mathrawgpa || qfitcimathrawgpa, ciplot(rline) • twoway scatter mathraw risk, jitter(4) || fpfitcimathraw risk, ciplot(rline) • Here are the three graphs from the code above:
Graphing – Residual Scatterplots • Let’s begin by fitting our regression model: xi3: regress mathraw female class_sz bus_time i.race*risk • Stata has two postestimation commands that allow us to check (raw) residuals against predictors and fitted values. Code: rvpplot class_sz, yline(0) rvfplot, yline(0) • Here are the graphs for these two commands:
Graphing – Residual Scatterplots • Suppose we want to plot the studentized residuals against the predictors and fitted values. • We must generate studentized residuals for each observation and also predict fitted values of mathraw for each observation. • Code: predict student if e(sample), rstudent predict fitted scatter student fitted, yline(2) yline(-2) scatter student class_sz, yline(2) yline(-2)
Graphing – Regression Lines • We can create fitted regression lines in Stata by using the xi3 function with the regression command. • The graph is generated by the postgr3 command followed by the variable of interest. • Code: xi3: regress mathraw i.race*risk class_sz bus_time female postgr3 risk • Here is the graph which results from this code:
Graphing – Regression Lines • We can enhance our graph to show the effect of risk on mathraw by including prototypical values of class_sz in the graph. • Code: • gen class_cat=1 if class_sz<=17 • replace class_cat=2 if class_sz>17 & class_sz<=30 • replace class_cat=3 if class_sz>30 • xi3: regress mathrawi.race*risk class_catbus_time female • postgr3 risk, by(class_cat) • Here is the graph from the code:
Graphing – Regression Lines • We can also generate a graph of the regression lines which show the interactions between risk and race. • Code: postgr3 risk, by(race) • Here is the graph produced by the code:
Graphing – Regression Lines • We can use the graph combine command in Stata to join two graphs together. • Code: xi3: regress mathrawclass_szbus_time female i.race*risk postgr3 risk, by(race) x(female=1) name(female) postgr3 risk, by(race) x(female=0) name(male) graph combine female male, ycommon • Here is the graph for the preceding code:
Questions: • Please complete the evaluation of this workshop • Thank you!