1 / 24

Chapter 1 Introduction

Chapter 1 Introduction. What are longitudinal and panel data? Benefits and drawbacks of longitudinal data Longitudinal data models Historical notes. 1.1 What are longitudinal and panel data?. With regression data , we collect a cross-section of subjects.

bien
Download Presentation

Chapter 1 Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 1Introduction • What are longitudinal and panel data? • Benefits and drawbacks of longitudinal data • Longitudinal data models • Historical notes

  2. 1.1 What are longitudinal and panel data? • With regression data, we collect a cross-section of subjects. • The interest is comparing characteristics of the subject, that is, investigating relationships among the variables. • In contrast, with time series data, we identify one or more subjects and observe them over time. • This allows us to study relationships over time, the so-called dynamic aspect of a problem. • Longitudinal/panel data represent a marriage of regression and time series data. • As with regression, we collect a cross-section of subjects. • With panel data, we observe each subject over time. • The descriptor panel data comes from surveys of individuals; a panel is a group of individuals surveyed repeatedly over time.

  3. Example 1.1 - Divorce rates • Figure 1.1 shows the 1965 divorce rates versus AFDC (Aid to Families with Dependent Children) for the fifty states. • The correlation is -0.37. • Counter-intuitive? - we might expect a positive relationship between welfare payments (AFDC) and divorce rates.

  4. Example 1.1 - Divorce rates • A similar figure shows a negative relationship for 1975 (the correlation is -0.425) • Figure 1.2 shows both 1965 and 1975 data, with a line connecting each state • The line represents a change over time (dynamic), not a cross-sectional relationship. • Each line displays a positive relationship - as welfare payments increase so do divorce rates. • This is not to argue for a causal relationship between welfare payments and divorce rates. • The data are still observational. • The dynamic relationship between divorce and AFDC is different from the cross-sectional relationship.

  5. Figure 1.2 1965 and 1975 Divorce rates versus AFDC

  6. Some notation • Longitudinal/panel data - regression data with “double subscripts.” • Let yit be the response for the ith subject during the tthtime period. • We observe the ith subject over t=1, ..., Ti time periods, for each of i=1, ..., n subjects. • First subject - (y11, y12, ... , y1T1 ) • Second subject - (y21, y22, ... , y2T2 ) • . . . • . . . • The nth subject - (yn1, yn2, ... , ynTn)

  7. Prevalence of panel data analysis • Importance in the literature • Panel data are also known as “cross-section time series” data in the social sciences • Referred to as “longitudinal data analysis” in the biological sciences • ABI/INFORM - 326 articles in 2002 and 2003. • The ISI Web of Science - 879 articles in 2002 and 2003. • Important panel data bases • Historically, we have: • Panel Survey of Income Dyanmics (PSID) • National Longitudinal Survey of Labor Market Experience (NLS) • Financial and Accounting • Compustat, CRSP, NAIC • Market scanner databases • See Appendix F

  8. Appendix F. Selected Longitudinal and Panel Data Sets • Table F.1 – 20 International Household Panel Studies • Table F.2 – 5 Studies focused on youth and education • Table F.3 – 4 Studies focused on the elderly and retirement • Table F.4 – 7 miscellaneous studies, including • election data, • manufacturing data, • medical expenditure data and • insurance company data

  9. 1.2 Benefits and drawbacks of longitudinal data • Several advantages of longitudinal data compared to • data that are either purely cross-sectional (regression) or • purely time series data. • Having longitudinal data allows us to: • Study dynamic relationships • Study heterogeneity • Reduce omitted variable bias • With longitudinal data, one can also argue • Estimators are more efficient • Addresses the causal nature of relationships • Main drawback - attrition

  10. Dynamic relationships • Static versus dynamic relationships • Figure 1.1 showed a cross-sectional (static) relationship. • We estimate a decrease of 0.95 % in divorce rates for each $100 increase in AFDC payments. • Figure 1.2 showed a temporal (dynamic) relationship. • We estimate an increase of 2.9% in divorce rates for each $100 increase in AFDC payments. • From 1965 to 1975, AFDC payments increased an average of $59 and divorce rates increased 2.5%.

  11. Historical approach • In early panel data studies, pooled cross-sectional data were analyzed by • estimating cross-sectional parameters using regression and • using time series methods to model the regression parameter estimates, treating the estimates as known with certainty. • Theil and Goldberger (1961) provide an early discussion on the advantages of estimating these two aspects simultaneously.

  12. Dynamic relationships and time series analysis • When studying dynamic relationships, univariate time series methods are the most well-developed. • However, these methods do not account for relationships among different subjects. • Multivariate time series accounts for relationships among a limited number of different subjects. • Time series methods requires a fair number (generally, at least 30) observations to make reliable inferences.

  13. Panel data as repeated time series • With panel data, we observe several (repeated) subjects for each time period. • By taking averages over subjects, • our statistics are more reliable • we require fewer time series observations to estimate dynamic patterns. • For repeated subjects, the model is yit=  + it, t=1, ..., Ti, i=1, ..., n. • Here,  is the overall mean and it represents subject-specific dynamic patterns. • “Unfortunately,” we don’t get identical repeated looks. • We hope to control for differences among subjects by introducing explanatory variables, or covariates. • A basic model is yit =  + xit´  + it, where xit is the explanatory variable. • Introducing explanatory variables leaves us with only subject-specific dynamic patterns, that is, yit - ( + xit´ = it

  14. Heterogeneity • Subjects are unique. • In cross-sectional analysis, we use yit =  + xit´ + it • ascribe the uniqueness to " it ". • In panel data, we have an opportunity to model this uniqueness. • The model yit = i + xit´ + it is • unidentifiable in cross-sectional regression. • In panel data, we can estimate  and 1, .., n. • Subject-specific parameters, such as i, provide an important mechanism for controlling heterogeneity of individuals. • Vocabulary: • When {i} are fixed, unknown parameters to be estimated, we call this a fixed effects model. • When {i} are drawn from an unknown population, that is, random variables, we call this a model with random effects.

  15. Heterogeneity bias • Suppose that a data analyst mistakenly uses the model yit =  + xit´ + it when yit = i + xit´ + it is the true model. • This is an example of heterogeneity bias, or a problem with aggregation with data. • Similarly, one could have different (heterogeneous) slopes yit =  + xit´i + it • or different intercepts and slopes yit = i + xit´i + it

  16. Omitted variables • Panel data serves to reduce the omitted variable bias. • When omitted variables are time constant, we can still get reliable estimates. • Consider the “true” model yit =  + xit´ + zi´ + it. • Unfortunately, we cannot (or not thought to) measure zi. • It is “lurking” or “latent.” By considering the changes yit*= yit - yi,t-1 = ( + xit´ + zi´ + it) - ( + xit-1´ + zi´ + it-1) = (xit- xit -1 )´ + it- it-1) = xit* ´ + it* • we do not need to worry about the bias that ordinarily arises from the latent variable, zi. • Introducing the subject-specific variable i, accounts for the presence of many types of latent variables.

  17. Efficiency of Estimators • Subject-specific variables i also account for a large portion of the variability in many data sets • This reduces the mean square error • Increases the efficiency (or reduces the standard errors) of our parameter estimators. • With panel data, we generally have more observations than with time series or regression. • A longitudinal data design may yield more efficient estimators than estimators based on a comparable amount of data from alternative designs. • Suppose that the interest is in assessing the average change in a response over time, such as the divorce rate. • A repeated cross-section yields • Longitudinal data design yields

  18. Causality and correlation • Three ingredients necessary for establishing causality, taken from the sociology literature: • A statistically significant relationship is required. • The association between two variables must not be due to another, omitted, variable. • The “causal” variable must precede the other variable in time. • Longitudinal data are based on measurements taken over time and thus address the third requirement of a temporal ordering of events. • Moreover, longitudinal data models provide additional strategies for accommodating omitted variables that are not available in purely cross-sectional data.

  19. Drawbacks: Sampling Design (attrition) • Selection bias • may occur when a rule other than simple random sampling is used to select observational units • Example – “endogeneous” decisions by agents to join a labor pool or participate in a social program. • Missing data • Because we follow the same subjects over time, nonresponse typically increases through time. • Example: US Panel Study of Income Dynamics (PSID): • In the first year (1968), the nonresponse rate was 24%. • By 1985, the nonresponse rate was about 50%.

  20. 1.3 Longitudinal data models • Types of inference • Primary. We are interested in the effect that an (exogenous) explanatory variable has on a response, controlling for other variables (including omitted variables). • Forecasting. We would like to predict future values of the response from a specific subject. • Conditional means. • We would like to predict the expected value of a future response from a specific subject. • Here, the conditioning is on latent (unobserved) characteristics associated with the subject. • Types of applications - many

  21. Social science statistical modeling • A model based on data characteristics is known as a sampling based model. The model arises from a data generating process. • In contrast, a structural model is a statistical model that represents causal relationships, as opposed to relationships that simply capture statistical associations. • Why bother with an extra layer of theory when considering statistical models? Manski (1992) offers : • Interpretation - the primary purpose of many statistical analyses is to assess relationships generated by theory from a scientific field. • Structural models utilize additional information from an underlying functional field. If this information is utilized correctly, then in some sense the structural model should provide a better representation than a model without this information. (explanation) • Particularly for public policy analysis, the goal of a statistical analysis is to infer the likely behavior of data outside of those realized (extrapolation).

  22. Modeling issues • With subject-specific parameters, there can be many parameters that describe the model • “Fixed” versus “random” effects models • Incorporating dynamic structure is important • Econometric “dynamic” models (lagged endogenous) versus serial correlation approach • Linear versus nonlinear (generalized linear) models • Marginal versus hierarchical estimation approaches • Parametric versus semiparametric models • We wish to separate the effects of: • the mean • the cross-sectional variance and • serial correlation structure

  23. 1.4 Historical notes • The term ‘panel study’ was coined in a marketing context when Lazarsfeld and Fiske (1938) • Considered the effect of radio advertising on product sales. • People buy a product would be more likely to hear the advertisement, or vice versa. • They proposed repeatedly interviewing a set of people (the ‘panel’) to clarify the issue. • Econometrics • Early economics applications include Kuh (1959), Johnson (1960), Mundlak (1961) and Hoch (1962). • Biostatistics • Wishart (1938), Rao (1959, 1965), Potthoff and Roy (1964) – used multivariate analysis to consider the problem of polynomial growth curves of serial measurements from a single group of subjects. • Grizzle and Allen (1969) – introduced covariates

More Related