1 / 47

Using R and Longitudinal Data to Answer Policy Questions

Using R and Longitudinal Data to Answer Policy Questions. Jared Knowles Wisconsin Department of Public Instruction NCES DC-STATS 2012 Washington DC, July 12, 2012. Overview. Why R? Examples of R Analyses? Share R code across states Develop joint methods Produce reports. Why R?.

lcristopher
Download Presentation

Using R and Longitudinal Data to Answer Policy Questions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using R and Longitudinal Data to Answer Policy Questions Jared Knowles Wisconsin Department of Public Instruction NCES DC-STATS 2012 Washington DC, July 12, 2012

  2. Overview • Why R? • Examples of R Analyses? • Share R code across states • Develop joint methods • Produce reports

  3. Why R? • R is free • R is open source • R is best in class and state of the art • R is free

  4. R works for government • R is transparent • R is reproducible • R is accurate • R works! Today!

  5. R

  6. Google Scholar Hits R has recently passed Stata on Google Scholar hits and it is catching up to the two major players SPSS and SAS

  7. Google Scholar Hits These links come from the explosion of add-on packages to R

  8. R has an Active Web presence Usage of the R listserv for help has really exploded recently

  9. R Examples Read in data

  10. Simple Diagnostics plot of chunk unamed-chunk-1

  11. Advanced Diagnostics plot of chunk unamed-chunk-2

  12. More advanced Grade 3 Score Cohort Trends

  13. Can this generate information? • Graphics help explain, but are still descriptive • R can help on two fronts: • R can do advanced analytics that provide insight • R can graphically depict those analytics in simple ways that are intuitive to policy makers • Oh yeah…? Prove it. • BLBC study in Wisconsin • Regression Trees • Machine Learning Algorithms

  14. BLBC in Wisconsin • We fit statistical models to all the ESL students in Wisconsin modeling their future test score as best as possible through each strategy described above (on all three major assessments) • We also include an indicator of whether the student was eligible for BLBC instruction or not • We compare to see if the average student receiving BLBC did better than the average student receiving other ESL services, all else equal • We report the range of uncertainty around this difference between student groups and compare them to see if any meaningful differences emerge • Due to our large sample size we expect our results to be biased in favor of finding statistically significant results, so we also examine the magnitude of findings to see if they are substantively significant; i.e. is the difference big enough to matter in the lives of students?

  15. Evaluations of Policy • Results are presented in effect sizes, or standard deviation units of change in test scores. • 0.1 is small, 0.2 to 0.4 is reasonable and is about a year of education in most cases. Bigger than 0.4 is huge.

  16. Results I • The bars represent the 95% confidence internal around the estimate. The VAM model is consistently statistical significant, not overlapping 0, and negative. • The length of the bars • represent the uncertainty • about the estimate. But • the mean effect size • is quite small, less than • 0.1 standard deviations in • most cases. • This represents a • Year-to-year change in • student’s score Between • BLBC and non-BLBC • instruction.

  17. Results I (graphic)

  18. Results II Language is different. Wisconsin has a large sample of both Hmong and Spanish speakers and they have different results when analyzed separately

  19. Conclusions and Next Steps • In Wisconsin BLBC does not have the negative effects found in recent research on California, though a slight negative effect may exist in some cases • Unlike other states where BLBC has been studied, Wisconsin has substantially different results between language groups on mathematics and possibly on English proficiency achievement • There is still a lot of imprecision in the estimates used here and more precision would be helpful, but effects are not substantively large in terms of relative student performance, even in the upper and lower bounds

  20. Next Steps • Get more data over more years and use a more precise estimation technique to reduce uncertainty about effects • Explore the variation across BLBC programs in addition to the mean effect • Learn more about the non-cognitive non-academic outcomes for BLBC in order to understand the costs and benefits of BLBC programs more fully • Estimate a “treatment-on-the-treated” parameter to more directly compare to prior research • Survey teachers and merge teacher/program practice data with student outcomes to begin exploring the effective components of BLBC and ESL programs

  21. Inference Trees • Get more data over more years and use a more precise estimation technique to reduce uncertainty about effects • Explore the variation across BLBC programs in addition to the mean effect • Learn more about the non-cognitive non-academic outcomes for BLBC in order to understand the costs and benefits of BLBC programs more fully • Estimate a “treatment-on-the-treated” parameter to more directly compare to prior research • Survey teachers and merge teacher/program practice data with student outcomes to begin exploring the effective components of BLBC and ESL programs

  22. Inference Tree Example plot of chunk parttree

  23. R is a powerful platform • Can Standardize and Share / Compare Results • Execute the same code on each other’s data • Compare results • Compare methods of analysis and improve them • Build a professional community • Collaborate

  24. Some code sharing exists • DPI has begun working with the Strategic Data Project at Harvard to prepare their toolkit using R • Currently written in Stata • Goal is to package the toolkit into R functions that can be applied to any dataset that has the required elements • This work has begun with the creation of a few functions in R and some documentation • Can be found online at GitHub [github.com/jknowles/SDP-Toolkit-for-R] http://www.gse.harvard.edu/~pfpie/index.php/sdp/strategic-data-project-the-vision

  25. Race/Ethnicity Example • Consider this example data from the Strategic Data Project Toolkit: • Student 1 and Student 2 in this data have different races in different years • This doesn’t happen often in most of our data systems, but it does happen, especially across different datasets • For research this can cause problems and requires different business rules

  26. What business rules do we use? • Ad hoc and up to the researcher • Need standards • Need commonality • Need consistency • Need tools that make those things easy to do!

  27. What to do • Should this student be declared H, the modal race? • Should this student be declared B, the first occurring race? • Should this student be flagged as inconsistent? • Should this student be coded as multi-racial?

  28. Fix the data • We can do the modal category easily in R using a simple function

  29. Fixing the data in a few simple steps

  30. Some code sharing exists • DPI has begun working with the Strategic Data Project at Harvard to prepare their toolkit using R • Currently written in Stata • Goal is to package the toolkit into R functions that can be applied to any dataset that has the required elements • This work has begun with the creation of a few functions in R and some documentation • Can be found online at GitHub [github.com/jknowles/SDP-Toolkit-for-R] http://www.gse.harvard.edu/~pfpie/index.php/sdp/strategic-data-project-the-vision

  31. What happened • We implemented two business rules on over 59,000 observations in a few seconds on a few lines of code • First, the modal race is chosen for multiple race categories per student • If a tie exists (more than 1 mode), we map the most recent race • These business rules can be readily changed, i.e. we could use the first race or a multi-race code for students with multiple modes

  32. What happened (continued) • This script can be run every time data is extracted from the warehouse to do work on • It can be run by every analyst on every machine because R is free and easy to deploy! Consistency and repeatability. • It can be swapped out with a different business rule implemented just as easily for different purposes flexibility • It can be mapped and translated onto CEDS elements

  33. Next Steps • Once we clean up the data, analytics can be shared • Doing analytics is a simple next step in R • R has best in class machine learning algorithms used to classify data and predict • R is the tool of choice for data science algorithms

  34. A Data Mining Example • If we are interested in pure predictive analytics, R provides hundreds of best in class algorithms and methods to evaluate them • This is done primarily through the caret package, which provides an easy to use framework for comparing these algorithms • These models can be used to predict “classes” of students, predict student scores, or predict anything else of interest

  35. Do analytics on fixed data

  36. A Data Mining Example • If we are interested in pure predictive analytics, R provides hundreds of best in class algorithms and methods to evaluate them • This is done primarily through the caret package, which provides an easy to use framework for comparing these algorithms • These models can be used to predict “classes” of students, predict student scores, or predict anything else of interest

  37. Machine Learning

  38. Predictions

  39. Deviance

  40. Deviance (II)

  41. The best part • R is a programming language and can be used to produce reports • R can produce HTML, PDF, or other formats of reports • Examples: • Dropout risk reports for each high school • NSC reports by school district

  42. The best part (II) • R can do this by simply building a template and running analytics on the appropriate data subset, automatically • R can even e-mail reports, seriously • Your data analysts have access to a full programming language once they understand R, flexibility far above and beyond other analytics packages for a slightly steeper learning curve

  43. How to learn • Online with tutorials • DPI R Bootcamp in August • PD workshops elsewhere

  44. Online Tutorials • Visit this presentation online and look at these great resources for learning that are free and open source as well • R Features List [www.revolutionanalytics.com/what-is-open-source-r/r-language-features/] • Video Tutorials [http://www.twotorials.com/] • R Tutorials from Around the World [pairach.com/2012/02/26/r-tutorials-from-universities-around-the-world/] • R for SPSS/SAS Users [r4stats.com/add-ons]

  45. DPI R Bootcamp • DPI is offering a bootcamp on R August 2nd and 3rd. • Slots are limited for this two full days of R training. • Training materials will be made available online. As they are developed, they can be viewed at https://github.com/jknowles/r_tutorial_ed. • For more information, visit the website https://sites.google.com/a/dpi.wi.gov/rbootcamp/. • We will probably do another bootcamp soon, a few slots still available for this bootcamp. Interested? See me.

  46. Session Info This document is produced with knitr version 0.6.3. Here is my session info:

  47. Jared E. Knowles Wisconsin Department of Public Instruction Phone 608-266-3569 eMail jared.knowles@dpi.wi.gov

More Related