Replicating results procedures and pitfalls
1 / 19

Replicating Results- Procedures and Pitfalls - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Replicating Results- Procedures and Pitfalls. June 1, 2005. The JMCB Data Storage and Evaluation Project. Project summary Part 1- July 1982 JMCB started requesting programs/data from authors Part 2- attempt replication of published results based on submissions

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Replicating Results- Procedures and Pitfalls

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Replicating Results- Procedures and Pitfalls

June 1, 2005

The JMCB Data Storage and Evaluation Project

  • Project summary

    • Part 1- July 1982 JMCB started requesting programs/data from authors

    • Part 2- attempt replication of published results based on submissions

  • Review of results from Part 2 in

    Replication in Empirical Economics: The Journal of Money, Credit and Banking Project; The American Economic Review, Sept 1986, by Dewald, Thursby, Anderson

The JMCB Data Storage and Evaluation Project/ Dewald et al

  • The paper focuses on Part 2

    • How people responded to the request

    • Quality of the data that was submitted

    • The actual success (or lack thereof) of replication efforts

The JMCB Data Storage and Evaluation Project/ Dewald et al

  • Three groups:

    • Group 1: Papers submitted and published prior to 1982. These authors did not know upon submission that they would be subsequently asked for programs/data.

    • Group 2: Authors whose papers were accepted for publication beginning July, 1982

    • Group 3: Authors whose papers were under review beginning July, 1982

Summary of Responses/Datasets Submitted,

Dewald et al, p 591

Summary of Examined Datasets Dewald et al, p 591-592

“Our findings suggest that inadvertent errors in published empirical articles are a commonplace rather than a rare occurrence.” – Dewald et al, page 587-588

“We found that the very process of authors compiling their programs and data for submission reveals to them ambiguities, errors, and oversights which otherwise would be undetected.” – Dewald et al, page 589

Raw data to finished product

Raw data

Analysis data


Finished product

Raw Data -> Analysis Data

  • Always have two distinct data files- the raw data and analysis data

  • A program should completely re-create analysis data from raw data

  • NO interactive changes!! Final changes must go in a program!!

Raw Data -> Analysis Data

  • Document all of the following:

    • Outliers?

    • Errors?

    • Missing data?

    • Changes to the data?

  • Remember to check-

    • Consistency across variables

    • Duplicates

    • Individual records, not just summary stats

    • “Smell tests”

Analysis Data -> Results

  • All results should be produced by a program

  • Program should use analysis data (not raw)

  • Have a “translation” of raw variable names -> analysis variable names -> publication variable names

Analysis Data -> Results

  • Document-

    • How were variances estimated? Why?

    • What algorithms were used and why? Were results robust?

    • What starting values were used? Was convergence sensitive?

    • Did you perform diagnostics? Include in programs/documentation.

Thinking ahead

  • Delete or archive old files as you go

  • Use a meaningful directory structure

    (/raw, /data, /programs, /logfiles, /graphs etc.)

  • Use relative pathnames

  • Use meaningful variable names

  • Use a script to sequentially run programs

Example script to sequentially run programs

1. #! /bin/csh

2. #File location: /u/machine/username/project/scripts/myproj.csh

3. #Author: your name

4. #Date: 9/21/04

5. #This script runs a do-file in Stata which produces and saves a dta-file

6. #in the data directory. Stat-transfer converts the .dta file to .sas7bdat

7. #and saves the file in the data folder. The program is run on

8. #the new sas data-file.

9. cd /u/machine/username/project/

10. stata -b do programs/

11. st data/H00x_B.dta data/$file.sas7bdat

12. sas programs/

Log files

  • Your log file should tell a story to the reader.

  • As you print results to the log file, include words explaining the results

  • Don’t output everything to the log-file- use quietlyand noisily in a meaningful way.

  • Include not only what your code is doing, but your reasoning and thought process

Project Clean-up

  • Create a zip file that contains everything necessary for complete replication

  • Delete/archive unused or old files

  • Include any referenced files in zip

  • When you have a final zip archive containing everything-

    • Open it in it’s own directory and run the script

    • Check that all the results match

When there are data restrictions…

  • Consider releasing:

    • the subset of the raw data used

    • your analysis data as opposed to raw data

    • (at a minimum) notes on process from raw to analysis data PLUS everything pertaining to the data analysis

  • Consider “internal” and “external” version of your log-file:

    • Do this via a variable at the top of your log-files:

      local internal = 1

      list if `internal’ == 1

Ethical Issues

  • All authors are responsible for proper clean-up of the project

  • Extremely important whether or not you plan on releasing data and programs

  • Motivation

    • self-interest

    • honest research

    • the scientific method

    • allowing others to be critical of your methods/results

    • furthering your field

Ethical Issues – for discussion

  • What if third-party redistribution of data is not allowed?

  • Solutions for releasing data while protecting your time investment in data collection

  • Is it unfair to ask people to release data after a huge time investment in the collection?

  • Login