计量经济学及工具变量 1 、计量经济学是用来干什么的？评价经济政策、观察效果、验证经济理论、寻找逻辑关系、预测如何评价？如何检验？如何预测？总体：样本： 2 、计量中我们要做什么？

计量经济学及工具变量 1、计量经济学是用来干什么的？评价经济政策、观察效果、验证经济理论、寻找逻辑关系、预测如何评价？如何检验？如何预测？总体：样本： 2、计量中我们要做什么？ A、我们对某个或几个变量的回归系数感兴趣 B、写出计量模型：模型设定—变量取舍 C、得到这些系数的无偏、有效、一致估计量 3、什么因素会导致OLS回归无法获得无偏估计量？ Endogeneity

2007年5月22日《现代快报》： 有教育界人士指出，高中阶段学生的抽象思维能力和议论能力不强，写作为时动不动就“啊”“呀”地抒情，很难写出非常有条理、很严谨的议论，这与基础教育中现有教师队伍的性别构成偏向与女性有关。 2007年5月23日《青年报》：《学生思辩能力弱怎能全怪女教师》作为中学语文教师，笔者对这种“动辄就抒情”的文风深有同感，但对于“女教师比例过高导致抽象思维能力差”的说法却不敢苟同。孩子的成长过程中女教师比例偏高,有可能造成学生性格上和意志能力上的缺陷，但女教师多却与学生抽象思维能力和议论能力的形成没有必然的联系：一个人的抽象思维能力和议论能力强不强，除了天赋和性别的因素，也与后天影响有关。学生的阅读面及社会思辩氛围对学生抽象思维能力的提高具有很大的促进作用。而我们目前的教育机制偏重于培养学生的形象思维能力，从小学到初中，一直训练学生写记述文，训练学生的情感表达和抒发。学生的课外阅读，也自然偏重于趣味性比较强的“形象阅读”。

更主要的是，社会没有给孩子提供利于思辩能力提高的氛围。譬如电影电视，过多强调引人入胜的故事性和情感性，即使是说理性的电视节目，也要借助于形象思维作媒介。像央视最火的“百家讲坛”，实际上是将说理给故事化和趣味化了，这在一定程度上会转移学生对理性思辩的注意度。因此，学生思维能力的提高，依赖于教育者尤其是语文教师的针对性训练，社会理性氛围的形成也至关重要。更主要的是，社会没有给孩子提供利于思辩能力提高的氛围。譬如电影电视，过多强调引人入胜的故事性和情感性，即使是说理性的电视节目，也要借助于形象思维作媒介。像央视最火的“百家讲坛”，实际上是将说理给故事化和趣味化了，这在一定程度上会转移学生对理性思辩的注意度。因此，学生思维能力的提高，依赖于教育者尤其是语文教师的针对性训练，社会理性氛围的形成也至关重要。少年不识愁滋味，爱上层楼。爱上层楼，为赋新词强说愁。而今识尽愁滋味，欲说还休。欲说还休，却道新(天)凉好个秋。

What’s Endogeneity? Gauss-Markov Theorem: A1: Linear in parameters A2: Random sample A3: Zero conditional mean, E(u|x)=0 A4: Sample variation in the independent variable (unbiasedness of OLS) A5: Homoskedasticity (BLUE of OLS) when an independent variable is correlated with error term, then zero conditional mean assumption does not hold, it is said to be an endogenous variable; A. Omitting variables B. Measurement error C. Simultaneity

Omitting an important factor that is correlated with any of will cause endogeneity; What we have learnt on endogeneity caused by omitting variables ? A. Proxy B. FD C. FE Only when panel data is available, FD and FE can be used to estimate the effects of time-varying independent variables in presence of time-constant omitted variables…or we only have cross sectional data… But if we have time-varying omitted variables or we are interested in the effect of time-constant variables…or we only have cross sectional data…

Ch15. Instrumental Variables I. Introduction 1. What’s IV for endogenous variables? (1) where ability is omitted, and ability is correlated with education, so edu is an endogenous variable; (2) (3) We call z an IV for edu, z is exogenous in equation (1).

A. IV should has no partial effect on dependent variable; (IV不能对Y产生直接影响) B. IV should be related to the endogenous independent variable; (IV能够对内生变量产生直接影响) C. Assumption (2) cannot be tested, yet we can test assumption (3) by regressing endogenous variable (eg. edu) on IV in reduced form equation:

2. Why do we need IV? From equation We can get Then, we can get the IV estimator of i.e. If z = x, we can see that IV estimator is simply the OLS estimator;

In terms of population correlations and standard deviation, we have the probability limit of IV estimator, where and are standard deviations of u and x in the population So, we can see from this equation that: A. Even if Corr(z, u) is small, the IV estimator will be inconsistent if Corr(z, x) is also small; B. According to the law of large number, the IV estimator is consistent, provided that assumptions for IV are satisfied. In small samples, the IV estimator can have a substantial bias, so large samples are preferred when IV method is adopted. C.IV is preferred to OLS on asymptotic bias grounds when Corr(z, u)/Corr(z, x) ﹤Corr(x, u) plim of OLS estimator:

3. How to find an IV? Generally, we need to find an exogenous variable which has no direct effect on dependent variable, is not related to the omitted variable, and is correlated with the endogenous variable. 4. Some examples IV for education (ability is omitted) A. The last digit of an individual’s identity card number (bad IV) B. Mother’s or father’s education (not good enough) C. Number of siblings (it looks good)

IV for skipping classes on final exam score (ability is also omitted) (4) IV: distance between living quarters and campus distance is exogenous in equation (4)

5. More examples for IV: A. IV for education when ability is omitted A dummy variable that is equal to 1 if a man is born in the first quarter of the year (Angrist and Krueger, 1991, QJE, Vol. 106). B. Acemoglu, Johnson, and Robinson (NBER Working paper 7771, 2002) Question: Why Africa and Australia, New Zealand, Canada, US has different performance? Omitting Variable Bias

Employing mortality rate as IV for institutions, High mortality rate→ not suitable for settle→ extractive state→ worse institution Low mortality rate→ suitable for settle→ Neo-Europes → better institution; “坐寇”与“流寇”

Hall and Jones (QJE, Vol.114, 1999): Why does output per worker vary enormously across countries? Used the distance from the equator and the extent to which the languages of Western Europe are spoken as a mother tongue as IV for infrastructure. a. Western Europe expansion from the sixteenth to nineteenth century has influence around the world; language (English, Spanish, French, German, Portuguese) b. Western Europeans were more likely to settle in areas that were broadly similar in climates to Western Europe, which was far from equator; c. Western Europeans were more likely to migrate to and settle regions that sparsely populated at the start of the nineteenth century. (USA, Canada, New Zealand, Argentina)

6. Self-selection problem: Angrist (1990) studied the effect that being a veteran in the Vietnam war had on lifetime earnings. Perhaps people who get the most out of the military choose to join the Vietnam war, or the decision to join is correlated with other characteristics that affect earnings. Draft lottery number is a IV candidate for veteran: a. Randomly assigned lottery number is uncorrelated with error term; b. Those with a low enough number had to serve in the war, so the probability of being a veteran is correlated with lottery number.

7. Statistic inference on IV estimator Refer to (15.11) to (15.13) 8. R-squared after IV estimation A. The R-squared from IV estimation can be negative, it has no natural interpretation and cannot be used in the usual way to compute F tests of joint restrictions. B. Goodness-of-fit is not an important consideration of IV estimation. 9. STATA Command ivreg y (x1=IV) x2 x3 not: reg y IV X2 X3 xtivreg y (x1=IV) x2 x3

An example of IV method (1) Question: Is edu an important determinant of availability of bank loan? Ability is omitted which should be related to education, so edu is potentially endogenous. Find an IV for edu: Average edu of contemporary villagers Peer effect:近朱者赤、近墨者黑 Regressing edu on average edu, coefficient=0.4768，standard error=0.0474，R2=0.5476

Example (2): Agglomeration and Economic Growth 章元等，《聚集经济与经济增长》，《世界经济》2008年第3期；模型设定、遗漏变量、度量误差、联立性 pergdp: 人均GDP （生产函数？） pgdpgr: 人均GDP增长率 agglom: 人口密度（资本密度？） avinv: 人均固定资产投资（人均资本存量？）

数据：中国地级城市面板数据，1998-2004 聚集经济的内生性解决办法：IV agglom=人口密度 IV for agglom：1933年各个城市是否通铁路民国时期的铁道部业务司于1934年编写的《中国铁道便览》白寿彝1937年所著的《中国交通史》 “火车一响，黄金万两” 1933-1949-1998 xtreg pergdpgr agglom, re xtreg pergdpgr agglom, fe xtreg agglom railway33 xtivreg pergdpgr (agglom=railway33) In fact, two steps for IV method: xtreg agglom railway33 predict agglom2 xtreg pergdpgr agglom2

II. IV Estimation of Multiple Regression Model 1. Identification of structure equation model A. can not be used as IV for ; B. We need another exogenous variable as IV for , e.g. ; Then we have a reduced form equation for , Identification condition:

III. 2SLS 1. A single endogenous independent variable with multiple IVs If we have two exogenous variables correlated with an endogenous variable, we could use each as IV, but it’s less efficient than having two IVs. e.g. we have two exogenous variables, and , they are correlated with , then we have reduced form of equation where Then, the best IV for is the linear combination of these two variables; Identification Assumption: or F test

2. what’s 2SLS ? A. The first stage regress on , , and , and obtain the fitted values: B. The second stage: regress on and . use as IV for , then we can get the 2SLS estimator; C. Example:

Why other exogenous variables should be included in the first stage regression? After partialling out the effect of other exogenous variables, IV and endogenous variable are still correlated; or 3. 2SLS Command xtivreg logwage (edu=mothedu fathedu) exp exp2, first ivreg logwage (edu=mothedu fathedu) exp exp2, first

Example: railway33 and agglom84, iv for agglom xtreg pergdpgr agglom avinvest avfisc avfdi edu privatize east middle year xtivreg pergdpgr (agglom=railway33 agglom84) avinvest avfisc avfdi edu privatize east middle year xtivreg pergdpgr (agglom=railway33 agglom84) avinvest avfisc avfdi edu privatize east middle year, first xtreg agglom railway33 agglom84 avinvest avfisc avfdi edu privatize east middle year test railway33 agglom84

4. Multiple Endogenous Independent Variables Order condition (阶条件) for identification of an equation: We need at least many excluded exogenous variables as there are included endogenous explanatory variables in the structural equation xtivreg y (y1 y2=x1 x2) x3 x4 x5

IV. IV for Errors-in-Variables Problem 1. IV also can be used to deal with measurement error problem. Let’s find another measurement on , and That’s to say z also mis-measure , but we can assume that and are uncorrelated. So, z can be used as IV to resolve measurement error problem coming from mis-measured .

2. Some Examples A. Saving or income information from husband and wife (sometimes maybe problematic) B. Worker’s salary reported by himself and by his employer (sometimes maybe problematic) C. Education level reported by himself and by his brother or sisters Why not proxy? Biased !

V. Test for Endogeneity and Overidentifying 1. Test for Endogeneity , and maybe endogenous We have another two exogenous variables and ; If and only if and are uncorrelated, is uncorrelated with ; If and are correlated, and are correlated, then is endogenous in structure function. Estimating this equation by OLS, if is statistically different from zero, we conclude that is endogenous.

Example: railway33, IV for agglomeration Test for Edogeneity: sort code year tis year iis code xtreg agglom railway33 avinvest avfisc avfdi edu privatize east middle year predict agglom1 gen u=agglom-agglom1 xtreg pergdpgr agglom u avinvest avfisc avfdi edu privatize east middle year

2. Testing Over-identifying , and maybe endogenous We have another two exogenous variables and can be used as IV for ; We can estimate this equation using only as an IV for , then we can get the residuals, If the residuals and are correlated in the sample, is not a valid IV for ; That’s to say, is correlated with , so it’s not exogenous and cannot be used as IV. We can also test whether is correlated with , provided that and are uncorrelated. 假设其中一个IV外生,去检验另一个是否外生;

3. Testing procedure of over-identifying a. Estimate the structural equation by 2SLS and obtain residuals, ; b. Regress on all exogenous variables and obtain R-squared, ; c. Null hypothesis: all IVs are not correlated with , , where q is the number of IV from outside the model minus the total number ofendogenous explanatory variables. d. If exceed (say) the 5% critical value in the distribution, we reject and conclude that at least one of the IVs is not exogenous. P821, Table G4: n=1, 10%-2.71, 5%-3.84, 1%-6.63

Example: railway33 agglom84, IV for agglomeration Test for Edogeneity: xtivreg pergdpgr (agglom=railway33 agglom84) avinvest avfisc avfdi edu privatize east middle year predict pergdpgr1 gen u=pergdpgr-pergdpgr1 xtreg u railway33 agglom84 avinvest avfisc avfdi edu privatize east middle year R2=0.0006 N*R2=0.7854 reg u railway33 agglom84 avinvest avfisc avfdi edu privatize east middle year N*R2=1309*0.014=1.83

现代经济学系列讲座”第一百六十六期 Using Genetic Lotteries within Families to Examine the Causal Impact of Poor Health on Academic Achievement Steven Lehrer Queen’s University and NBER 2008年5月28日13:30-15:00 经济学院710会议室

4. Hausman Test H0: the efficient estimator is a consistent and efficient estimator of the true parameters. If it is, there should be no systematic difference between the coefficients of the efficient estimator and a comparison estimator that is known to be consistent for the true parameters. P value is smaller than 10%, which means the structure model is endogenous, so we reject H0 and accept the IV estimators. A. IV to OLS B. IVFE to FE C. IVRE to RE D. IVFE to IVRE E. FE to RE

Command of Hausman Test:1. ivreg borrow age edu edusquare (poverty1= dependentratio) govk govk2 lmoney dummyeast dummywest road2. est store abc3.reg borrow age edu edusquare poverty1 govk govk2 lmoney dummyeast dummywest road4.hausman abcP value is negative with small sample, reference to SUEST

Example for Hausman Test: xtreg pergdpgr agglom avinvest avfisc avfdi edu privatize east middle year, fe est store xyz xtreg pergdpgr agglom avinvest avfisc avfdi edu privatize east middle year, re hausman xyz Test: Ho: difference in coefficients not systematic chi2(5)=(b-B)'[(V_b-V_B)^(-1)](b-B) =16.59 Prob>chi2=0.0053

VI. 2SLS with Heteroskedasticity 1. How to test for heteroskedasticity in the context of 2SLS? Let denote the 2SLS residuals and let denote all the exogenous variables, an asymptotically valid statistic is F statistic for joint significance in a regression of on , Null hypothesis of homoskedasticity is rejected if all of the exogenous are jointly significant. 2. How to deal with heteroskedasticity in the context of 2SLS? Weighted 2SLS procedure: divide the dependent variable, constant, explanatory variables and all the IV by , where denotes the estimated variance.

The Heteroskedasticity Fn Must Be Estimated • FGLS Estimator: Using the estimator, , instead of in the GLS transformation yields an estimator (model the function h and use the data to estimate the unknown parameters) • One FGLS: • Procedure: or

VII. Applying 2SLS to pooled cross section and panel data 1. Applying 2SLS to pooled cross-section We should often add time period dummy variables to allow for aggregate time effects. 2. Applying 2SLS to panel data With panel data at hand, 2SLS can also be combined with first differencing to get consistent estimators in presence of endogeneity in time-varying explanatory variables. P512: Example 15.10 IV: also time-varying

Ch16. Simultaneous Equation Model Simultaneity: when one or more of the explanatory variables is jointly determined with the dependent variable, typically through an equilibrium mechanism, OLS estimators are generally biased and inconsistent. I. Nature of SEM 1. An Example for illumination Labor supply: (1) Labor supply and wage are determined simultaneously Labor demand: (2) hour worked and wage are determined simultaneously; we could not exogeneousely choose wage and work hours for a random sample of workers. Equation (1) is labor supply curve, but the data used for regression is observed equilibrium hours and wages; Equation (2) is labor demand curve, but the data used for regression is observed equilibrium hours and wages;

Equilibrium condition: Combine the equilibrium condition with the supply and demand curve, We get and These two equations constitute a SEM with two important features: A. Given , are determined by these two equations; are endogenous variables; B. are exogenous variables, without including them in the model, we can not tell which equation is the supply function and which is the demand function.

2. More examples: A. Murder rate and the size of police force Question: How much additional law enforcement will decrease the murder rate? (1) The more police men, higher murder rate? (2) Equation (1) describes the actions of potential murderers; Equation (2) describes the action of city officials; B. Housing expenditure and saving (page 529 example 16.2) a. Housing and saving are chosen by the same household; b. This SEM can not be identified;

II. Simulation Bias in OLS Let’s consider the two-equation structural model: (1) (2) is generally correlated with because of simultaneity, so we say that OLS suffer from simultaneity bias. Plug (2) into (1), we get A. Only if , we can get B. If holds, and are correlated, then is endogenous C. If , is not simultaneously determined with

III. Identifying and Estimating SEM 1. Identification in a Two-Equation System A. An Simple Supply and Demand Example Supply: (1) Demand: (2) Demand function is identified but the supply function is not, for that we can use cattlefoodprice as an IV for price in the demand equation. But we have no IV for price in the supply equation. Cov(cattlefoodprice, price) ≠0 Cov(cattlefoodprice, demand)=0 Summarization: In the system of (1) and (2), it is the presence of an exogenous variable in the supply equation that allows us to estimate the demand equation.

2. A General Two-Equation Example (3) (4) denotes a set of exogenous variables in equation (3) denotes a set of exogenous variables in equation (4) We assume that certain exogenous variables do not appear in the first equation and others are absent form the second equation, which allows us to distinguish between the two structural equations. Condition A: Condition B: The first equation in a two-equation SEM is identified if, and only if,the second equation contains at least one exogenous variable (with nonzero coefficient) that is excluded from the first equation, which is called Rank Condition (秩条件) in SEM. Order condition (阶条件) for identification of an equation: We need at least many excluded exogenous variables as there are included endogenous explanatory variables in the structural equation

3. Other examples Labor supply of married women: P535 Supply equation: Demand equation: A. Identify the supply equation: Order condition: two exogenous variables, exp and are omitted from the supply equation; Rank condition: at least one of exp and has a nonzero coefficient in demand equation; B. Identify the demand equation: Order condition: two exogenous variables, kidslt6 and nwifeinc are omitted from the demand equation; Rank condition: at least one of kidslt6 and nwifeinc has a nonzero coefficient in supply equation;

4. Estimation by 2SLS Once we determined that an equation is identified, we can estimated each equation by 2SLS, and the IVs consist of the exogenous variables appearing in either equation. 5. Stata Command Xtivreg logwage (edu = mothedu fathedu) exp exp2 or reg3 (y1 y2 x1 x2) (y2 y1 x1 x3)

IV. Complicated SEM 1. Identification (5) (6) (7) A. Equation (7) can not be identified; B.Order condition: An equation in any SEM satisfied the order condition for identification if the number of excluded exogenous variables from the equation is at least as the number of right-hand side endogenous variables. C.Rank condition: nonzero coefficient

2. Estimation When any system with two or more equations is correctly specified and certain additional assumption hold, system estimation methods(3SLS)are generally more efficient than estimating each equation by 2SLS. 参见Wooldridge (2002)高级版本 4. SEM Command: reg3 (y1 x1 x2) (y2 x1 x3) reg3 (y1 y2 x1 x2 x3) (y2 y1 x1 x4 x5 x6) reg3 (y1 x1 x2 x3) (y2 x2 x3 x4 x5) (y3 x4 x5 x6 x7 x8)

V. SEM with Panel Data Basic Approach: First step: Eliminate the unobserved effects from the equation (FD or FE) Second step: Find IV for endogenous variables in the transformed equation An example: “Growth and Poverty in Rural China: the Role of Public Investment” Shenggen Fan, Linxiu Zhang, and Xiaobo Zhang The purpose of this study is to investigate the causes of the decline in rural poverty in China, and particularly to quantify the specific role that government investments may have played. They seek to quantify the effectiveness of different types of government expenditures in contributing to poverty alleviation.

计量经济学及工具变量 1 、计量经济学是用来干什么的？ 评价经济政策、观察效果、验证经济理论、寻找逻辑关系、预测 如何评价？如何检验？如何预测？ 总体： 样本： 2 、计量中我们要做什么？