1 / 35

Automatisation in Stata

Automatisation in Stata. Jan Hagemejer & Joanna Tyrowicz. Plan. Standard solutions Where they do not work ? Usually more than one way to estimate – how to chose ? Using loops and global function together Generating the resultssets for atypical estimations .

wolfe
Download Presentation

Automatisation in Stata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatisation in Stata Jan Hagemejer & Joanna Tyrowicz

  2. Plan • Standard solutions • Wherethey do not work? • Usuallymorethan one way to estimate – how to chose? • Usingloops and globalfunctiontogether • Generatingtheresultssetsfor atypicalestimations. • Difficultieswithusingbootstrap (and obtainingresultssets) • Summarycomments … and someadvices Jan Hagemejer & Joanna Tyrowicz

  3. The standard route • Problem: severalestimations of similar form. • Need to compareresults. • Threesimplesolutions: • Solution 1: bruteforce = sit & type • Solution 2: useparmby/parmest: ifestimations on simplecategoriesin data (limitations of „by” command) • Solution 3: useloops • See N. Cox’smaterialfromprevious SUGM) • Commandsdeveloped by Roger Newson: outreg/outreg2 • nicelyformattedtables, • publication-ready, • in many formats, evenLaTeX. • Note: ifyouneed nice summarystatistics, youcanuseoutsumeitherwithbyorwithinloops Jan Hagemejer & Joanna Tyrowicz

  4. Where the problems come from? • 2nd and 3rd solutionworksonlywithregression-typeestimations • However, someproceduresareincompatiblewithpre-cookedsolutions • Examples: • Marginaleffects, • Useoutreg2in Stata10 ifusedprobit/logitinstead of probit/logit • Useoutreg2in Stata11 withmargins and/ormfx2(remeberaboutreplaceoption) • Nice statistics • Usetempname and postfilesyntax • Rolling window on any of thistype of analysis Jan Hagemejer & Joanna Tyrowicz

  5. Not everything may be solved this way… • Reason 1: thingsmorecomplexthantheyseem (to comein a sec..) • Reason 2: somethingsare not listedintheoutput: • Example: variousversions of R2 orsamplesizeinsimpleregressions • outreg/parmesttypically do not includethem • theycan be included as additionallocals • youneed to knowwhatlocalstheyare=>solution: thefamily of „return list” commands • ret li =>results stored in r(), general commands • eret li =>results stored in e(), estimationcommands • sret li => results stored in s(), programmingcommands • Practicalexample Jan Hagemejer & Joanna Tyrowicz

  6. Cookbook for „simple” problems • Run procedure • Check with the use of „return list” family, which statistics you need • Add locals that should be generated after the procedure • Add these statistics to outreg2/parmest commands forvalues no=1(1)10 { xi: xtreg x y z i.year i.month if g`no'==1, fe robust local Between=e(r2_b) local Within=e(r2_w) local No_min=e(g_min) local No_max=e(g_max) outreg2 using file.xls, bdec(4) title(Title) ctitle(`no') append excel addstat(R2 between, `Between', R2 within, `Within', No min, `No_min', No max, `No_max', No average, `No_avg') } Jan Hagemejer & Joanna Tyrowicz

  7. Our problem is different – application to PSM • Need to report: • output of the procedure • sample properties after matching • balancing properties of matching • Problem1: actually, none of these is in the typical output • Problem2: we need it for many estimations looped over many variables and each one of them takes a looooong time Jan Hagemejer & Joanna Tyrowicz

  8. Detailed problem description • Analyse the effects of privatisation • Observe what happens before and after the „event” of privatisation, but time runs: • E.g. firm A may be one year before privatisation in 1999 and firm B in 2006, so „event” is an anchor and time „runs” both ways. • Effects may be observed in many spheres: • E.g. profits, investments, international competitiveness, employment • Effects may be due to self-selection • E.g. only better firms are privatised, so difference in performance is not due to the privatisation • Effects may be largerly due to self-selection • Heckman correction will tell about the statistical significance but not about the economic relevance • Propensity score matching is the best solution Jan Hagemejer & Joanna Tyrowicz

  9. Detailed problem desciption • Run logistic regression: • Dependent variable: Y = 1, if participate; Y = 0, otherwise. • Choose appropriate conditioning (instrumental) variables. • Obtain propensity score: predicted probability (p) or log[p/(1 − p)]. • Match each participant to one or more nonparticipants on propensity score: • Choose an adequate metric • Compare outcome variables • Example: test means equality in sample treated and control group • In PSM: obtaining pscore is irrelevant, but matching is key • To verify if matching is ok, need to run some diagnostics • Example: compare the balancing properties after matching (so-called bias reduction thanks to matching) Jan Hagemejer & Joanna Tyrowicz

  10. Detailed problem description • Thus, in our case: • Many time periods (for each „time-to-anchor” a separate estimation) • Many variables (for each variable separate outcomes, but within one period the same balancing properties) • Two ways of estimating: regular and bootstrapping (especially the latter made things complex) • Each estimation: roughly 1.5-3.5 hours • Over a hundred estimations • Additional pitfalls: • We needed some statistics for all estimations and they were not in the return list • More precisely: procedure computes them to be able to produce output, but they were not added to the return list by authors Jan Hagemejer & Joanna Tyrowicz

  11. Summary of the problems Our problem was quitespecific… BUT consisted of many general problems: • Loopstake a lot of time – need to findefficientways • Somethingscannot be obtainedfast => evenmorereasons to run itautomatically • Obtainingdatasets of thevariables we need (so-calledresultssets) • Gettingvisible data iftheyare not an output • Usinginvisible data • Gettingaroundwithbootstrap Jan Hagemejer & Joanna Tyrowicz

  12. The structure of our estimations Jan Hagemejer & Joanna Tyrowicz

  13. Usingpscoreorpsmatch?

  14. Eventloop Usingpscoreorpsmatch? • Typical psmatch syntax: psmatch2 treat treatment_determinants, out(outcomes) options • Alternative • Estimate pscore first: pscore treatment treatment_determinants, pscore(name) • Run: psmatch2 treatment pscore, out(outcomes) options • How to choose? • If you want to bootstrap, pscore estimated once will save you time • If you want to introduce data-fitted caliper into options, pscore first is a must Jan Hagemejer & Joanna Tyrowicz

  15. Howglobalfunctioncan be usefull?

  16. Eventloop Usingtheglobalfunction for estimations • Ourapplication: observethe same firms back and forthfromthe moment of theprivtisation („event”) • „Events” happenindifferentyears • But we canonlymatch on one dimension: hasorhas not the „event” • Conceptualsolution: uselags and forwards to getthe time dimension • Technical problem: many outcomesvariables and de facto many loops • Technicalsolution: defineseparatelymatchingvariables and outputvariables global in="cut* remoteness eksporter energia obrotklratioroarosindebtednesswsk_plynnoscinet_income_efficiencyklratio_newroa_newindebtedness_newindebtedness_newwsk_plynnosci_new" global out="te_newredukcjawzrost_zatrshare_exportlewars_eff" global outf1="ff1_te_new ff2_te_new ff3_te_new ff4_te_new ff5_te_new ff1_redukcja ff2_redukcja ff3_redukcja ff4_redukcja ff5_redukcja ff1_wzrost_zatr ff2_wzrost_zatr ff3_wzrost_zatr ff4_wzrost_zatr ff5_wzrost_zatr" global outf2="ff1_share_export ff2_share_export ff3_share_export ff4_share_export ff5_share_export ff1_lewar ff2_lewar ff3_lewar ff4_lewar ff5_lewar ff1_s_eff ff2_s_eff ff3_s_eff ff4_s_eff ff5_s_eff" Jan Hagemejer & Joanna Tyrowicz

  17. Eventloop The begining of the estimations – so far forvalues d=6(1)18 { use data, clear capture log close capture drop our_pscore* caliper* mean* diff* ttest* se_after* se_before* treated nontreated log using priv_caliper_`d', text replace pscore d`d' $in, pscore(our_pscore_`d') ttestour_pscore_`d', by(d`d') unequal capture drop sd_nontreatedsd_treated gen sd_nontreated=`r(sd_1)' gen sd_treated=`r(sd_2)' gen caliper_`d'= ((sd_treated^2+sd_nontreated^2)/2)^0.5 sum caliper_`d' localc_real=`r(mean)' histnasz_pscore_`d', by(d`d') graphsave „our_pscore_d`d'.png", replace psmatch2 d`d' our_pscore_`d', out($out $outf1 $outf2) commonaddmahalanobis(nace) caliper(`c_real') Jan Hagemejer & Joanna Tyrowicz

  18. Getting from results to „resultssets” Jan Hagemejer & Joanna Tyrowicz

  19. Why (and what) do we need (in) the resultssets? • Why? • Most importantly: withoutresultssets we cannot • analysethechangesover time • decomposetheobserveddifferentials • If we do not do itautomatically, itwouldhave to be copiedmanuallyfromlogs – many estimations, many variables, etc • What ? Step 1: find out the reality • Size of each of thethreegroups: treated, total and control (= matched) • Averagesinallthreegroups (medians, etc.) • Knowledgeifinfacttheyaredifferent (= test of thestatisticalsignificancebased on difference and standard error of thisdifference) • What? Step 2: find out, howgoodthefindingsarestatistically • Balancingproperties! Jan Hagemejer & Joanna Tyrowicz

  20. Variablesloop Our solution to step 1 foreach out in $out $outf1 $outf2 { localse_after=r(seatt_`out') gen se_after_`out'=`se_after' localdiff_after=r(att_`out') gen diff_after_`out'=`diff_after' sum `out' if d`d'==0 & _support==1 localmean_nontreated=r(mean) gen mean_nontreated_`out'=`mean_nontreated' sum `out' if d`d'==1 & _support==1 localmean_treated=r(mean) gen mean_treated_`out'=`mean_treated' ttest `out' if _support==1, by(d`d')unequal localse_before=r(se) gen se_before_`out'=`se_before' localmean_before=r(mu_2)-r(mu_1) gen diff_before_`out'=`mean_before' gen ttest_before_`out'=diff_before_`out'/se_before_`out' gen ttest_after_`out'=diff_after_`out'/se_after_`out‘ CONTINUED ON THE NEXT SLIDE Jan Hagemejer & Joanna Tyrowicz

  21. Variablesloop Our solution to step 1 - continued foreach type in before after { label varse_`type'_`out' "Standard error of difference `type' matching" label vardiff_`type'_`out' "Difference `type' matching" label varttest_`type'_`out' "T-test of difference" } label varmean_treated_`out' "Mean of treated companies" label varmean_nontreated_`out' "Mean of non-treated companies (before matching)" } count if d`d'==1 & _support==1 localtreated=r(N) gen treated=`treated' label var treated "No of treated companies" count if d`d'==0 & _support==1 localnontreated=r(N) gen nontreated=`nontreated' label varnontreated "No of control companies" Jan Hagemejer & Joanna Tyrowicz

  22. Variablesloop Our solution to step 2 pstest$in foreachinin $in { capturelocalbias_reduction=r(bired_`in') capturelocalpvalue_bef=r(pbef_`in') capturelocalpvalue_after=r(paft_`in') capture gen b_red_`in'=`bias_reduction' capture gen pval_ber_`in'=`pvalue_bef' capture gen pval_aft_`in'=`pvalue_after' } outsheetb_red* pval* usingstats_priv_`d', replace psgraph graphsavepriv_support_`d', replace graph export priv_support`d'.png, replace drop b_red* pval* Jan Hagemejer & Joanna Tyrowicz

  23. „Missing statistics”

  24. Solving problem of „missing” statistics • Lookintothe„ado” file youareusing (procedure) • Throughoutthe file, therearecommands return scalarx=`somelocal’ • Sometimes – for clarity – scalarsaredroppedattheend of procedure • Yourpreferedstatistic (ifitisintheoutput, ithas to be atleast a local) wouldsimplyhave to have a locallikethattoo • Ifitdoes not – youcanalwaysgenerateitbased on yourpreferences and availablelocals => Modifytheoriginalado file Jan Hagemejer & Joanna Tyrowicz

  25. Solving problem of „missing” statistics – example 1 Original ado file – line 380 Modifiedado file – line 380 qui foreach v of varlist `varlist' { replace _`v' = . if _support==0 tempname m1t m0t u0u u1u att dif0 sum `v' if _treated==1, mean scalar `u1u' = r(mean) sum `v' if _treated==0, mean scalar `u0u' = r(mean) sum `v' if _treated==1 & _support==1, mean scalar `m1t' = r(mean) local n1 = r(N) sum _`v' if _treated==1 & _support==1, mean scalar `m0t' = r(mean) scalar `att' = `m1t' - `m0t' scalar `dif0' = `u1u' - `u0u‘ return scalar att = `att' return scalar att_`v' = `att' qui foreach v of varlist `varlist' { replace _`v' = . if _support==0 tempname m1t m0t u0u u1u att dif0 … /all the same as earlier plus / return scalardiff = `dif0' return scalar diff_`v' = `dif0‘ return scalar mean0 = `u0u' return scalar mean0_`v' = `u0u‘ return scalar mean1 = `u1u' return scalar mean1_`v' = `u1u' Jan Hagemejer & Joanna Tyrowicz

  26. Solving problem of „missing” statistics – example 2 Original ado file – line 440 Modifiedado file – line 440 return scalar seatt = `stderr' return scalar seatt_`v' = `stderr' qui regress `v' _treated scalar `ols' = _b[_treated] scalar `seols' = _se[_treated] return scalarseatt = `stderr' return scalar seatt_`v' = `stderr' qui regress `v' _treated scalar `ols' = _b[_treated] scalar `seols' = _se[_treated] return scalarseols = `seols‘ return scalar seols_`v' = `seols' Jan Hagemejer & Joanna Tyrowicz

  27. Problemswithbootstrap Jan Hagemejer & Joanna Tyrowicz

  28. Problemswithbootstrap • Whydid we needbootstrap? • Afterestimationss.e.’swererelativelylarge (heterogenoussample) • When we triedbootstraping, thereductioninthesize of s.e.’s was roughly 50% whileestimatorswereessentiallyunaffected • Whatproblemswithbootstrap? • Need to run itseparately for eachvariable (itbootstrapsonly one standard errorat a time) • Outputisgivenin a totallydifferent form • Ittakes a looong time • New piece of code for just BS standard errors => newvariableloopswithineach time loop Jan Hagemejer & Joanna Tyrowicz

  29. Problemswithbootstrap foreach out in $out $outf1 $outf2 { use data, clear sum caliper_`d‘ /thisiswheretheinitialpscorecomesuseful/ localc_real=`r(mean)‘ bootstrapr(att): psmatch2 d`d' our_pscore_`d', out(`out') commonaddmahalanobis(nace) caliper(`c_real') matrix mat = e(b), e(se) /withoutthis, no resultssets/ mat li mat svmat mat rename mat1 a`d'_diff_after_bs_`out‘ rename mat2 a`d'_se_after_bs_`out‘ gen time_of_event=`d' keep se* diff* ttest* mean* time_of_event a* drop if _n>1 savepriv_bs_`out'`d', replace } Jan Hagemejer & Joanna Tyrowicz

  30. Final steps • Mergefilesobtainedfrombootstrap on „event” (to have a completeresultssetwithineach „event” period) • Mergebootstrapresultssetswith • Appendthefiles for „event” periods • Organisethe data • Producetables and graphs (againinloops) • Write paper Jan Hagemejer & Joanna Tyrowicz

  31. The resulting graphs (1) • There are 6x3 figures alltogether Jan Hagemejer & Joanna Tyrowicz

  32. The resulting graphs (2) • There are 6x2 figures alltogether Jan Hagemejer & Joanna Tyrowicz

  33. The resulting graphs (3) • There are 6x3 figures alltogether Jan Hagemejer & Joanna Tyrowicz

  34. Some advices we did not take at the right time  • Save your computers’ time (your wasted time is your problem ) • Use „sample 10” for testing your procedures - saves a lot of time • Leaving mess is not useful if you ever want to come back • Your memory lasts shorter than that of saved files – describing dofiles really helps • Loops are better than copy&paste – and less messy too • STATA is not that complicated – modifying ado-files is really easy if you know what you want Jan Hagemejer & Joanna Tyrowicz

  35. Thank you for your attention! Jan Hagemejer & Joanna Tyrowicz

More Related