1 / 29

Discipline and Data Management: Good practice

Discipline and Data Management: Good practice. Professor Vernon Gayle University of Stirling. Overall Message. Working with a programming “syntax” is essential Do NOT use menus to operate data management software. Some Practical Thoughts….

ariana-dyer
Download Presentation

Discipline and Data Management: Good practice

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discipline and Data Management: Good practice Professor Vernon Gayle University of Stirling

  2. Overall Message • Working with a programming “syntax” is essential • Do NOT use menus to operate data management software

  3. Some Practical Thoughts… “The best habit you can get into is to get into good habits”

  4. The Paper Trail • Ensure that all serious work can be reproduced i.e. have a clear ‘paper trail’ in place You might have to reproduce additional work much later (months or even years) for example after referees comments, external examiners remarks etc.

  5. The Paper Trail • The platinum standard is that if a research assistant/fellow was killed in a freak accident the professor could complete the project See Long (2009) Section 1.1 Replication: The guiding principle for workflow

  6. The Paper Trail • The gold standard is that all files and notes are correctly and clearly set out so that they can be passed on to someone without much explanation • This will mean that you and the other members of the research team can follow the paper trail and therefore subsequently reproduce and augment material if required • This is particularly important as referees can often ask for minor, and in the case of some of my work major, amendments to statistical analysis

  7. Why Bother? The payoff from discipline… Work can be replicated Work can be transferred (e.g. personnel changes) Work ultimately becomes quicker • Getting into a muddle less often • Getting out of a muddle quicker Increasing important in research governance and openness and ethics

  8. Making A Start • IT IS ESSENTIAL TO KNOW YOUR DATA • This includes understanding how concepts have been operationalised (e.g. via the survey instrument). It is worth thinking about how the survey instrument has been applied. Think about all the tiny nuts and bolts, for example the rubric of questions and how the routing has been worked out. These minor issues may have a major impact on your data • Understanding how variables have been measured and coded is OBVIOUSLY essential. It is also worth getting to know the distribution of variables and some simple measures of central tendency (e.g. means and modes)

  9. Making A Start • Make sure that you are working with the best data available. In the case of the BHPS this will be the most recent release of the data • ALWAYS MAKE BACK-UP FILES - Work with as clean a set of data as possible • Always start with exploratory analysis • EVERY recode, compute, re-labelling task should be documented and be traceable in the paper trail • DON’T START ANALYSES TOO SOON!

  10. Some Little Tricks • Always “guesstimate” the output before you formally estimate (i.e. fit) your model. This will help trap errors or indicate when your data is “behaving badly” • Always have a notebook handy (or use notepad or your word processor) to help with the paper trail • Keep a calculator handy • If a job is incomplete keep a record. For example I frequently e-mail myself at the end of the day so that I am reminded the next time I log on

  11. A ‘Take-Home’ Message • REMEMBER – REAL DATA IS MUCH MORE MESSY, BADLY BEHAVED, HARD TO INTERPRET ETC. THAN THE DATA USED IN BOOKS AND AT WORKSHOPS

  12. Which Software For Data Handling? • You need a software than can • Help you keep track • a clear and concise programming syntax • Merge data sets (e.g. waves of a panel survey) • Match data (e.g. individuals to households) • Easily recode variables (often repeatedly) • Exploratory data analysis facilities • Flexible and can store, report and compare results • Allow you to re-run jobs later (sometime months)

  13. Which Software For Data Management? Stata Very Good SPSS Suitable (when syntax is used) Excel Inadequate for large-scale datasets R Needs to much programming SAS Needs too much programming MLwiN Difficult to do data management Mplus Difficult to do data management

  14. Which Software For Data Management & Survey Research? • In our view, Stata has both excellent data management facilities and the majority of mainstream social science data analysis techniques can be undertaken in a Stata

  15. Which Software For Data Management? • If you need to use a specialist data analysis package (e.g. SABRE, MLwiN) you might first need to do your data preparation (i.e. data management) in Stata first • This is a common approach but it still can be problematic because of the usual iterative cycle of data analysis • Many successful researchers are happy to settle with the functionality of Stata to avoid this problem

  16. Main Window – Where main activity takes place, and output is reported

  17. Review Window – We do not recommend using Stata in an ‘interactive mode’ so this window will not report too much

  18. Variables Window – This window contains information about your dataset

  19. Viewing data – Data is in a spreadsheet format Some people prefer to use the list command (list in 1/10) However others view the data in a spreadsheet using the data edit or data browse buttons

  20. The Stata Do-File Editor Some people prefer to use a text editor or programming tool

  21. *******************************************************. *** Longitudinal Data Analysis for Social Science Researchers ** ** ** ESRC Researcher Development Initiative training programme: ** ** ** Training materials lab 1: ** INTRODUCTORY LONGITUDINAL DATA ANALYSIS AND DATA MANAGEMENT - ** 5 APPROACHES TO QUANTITATIVE LONGITUDINAL DATA ANALYSIS . ** ** ** www.longitudinal.stir.ac.uk ** Paul Lambert / Vernon Gayle, 3 August 2008 *******************************************************. **** Stata VERSION ************************************** **********************************************************. *****************************************************. ** The file below covers introductory examples of five approaches to ** quantitative longitudinal data analysis: ** *** Section 1: Repeated cross-sectional survey data *** Section 2: Panel survey data *** Section 3: Cohort study survey data *** Section 4: Event history survey data *** Section 5: Time series statistical data ** **********************************************************. *******************************************************. ** GENERAL INSTRUCTIONS ON THESE FILES ** ** Work through this file in the interactive do-file editor, replicating ** the Stata do-file commands. Further help on working with Stata is ** available from the LDA web site. ** ** *** This lab file assumes you have a number of files downloaded to your ** machine. You will need the following: * ** 1) Downloadable from the LDA site : * - gb91soc2000.dat (this is used during variable constructions for the LFS exercise) * * ** 2) Downloadable from the UK Data Archive: * - ssa02.dta, ssa01.dta, ssa00.dta and ssa99.dta * (Scottish social attitudes 2002, 2001, 2000, 1999, * Stata datasets for study numbers 4808, 4804, 4503, 4346, Stata format files) * * - lfs1991.dta, qlfsja96.dta, qlfsja01.dta * (Labour Force Surveys mid 1991, 96, 2001 respectively, * Stata datasets from study numbers 2875, 3647, 4448) * (in previous editions of this exercise, the 1991 data was called f87511.dta) * * -All BHPS Waves 1-15 component files in Stata format (UK Data Archive Study number * 5151 (June 2007 release) (extracted from the zip file 5151Stata8.ZIP) * (warning - these are a large volume of files, ~153 different files, ~ 600MB) * * - The six Stata format 'episode' files from the BHPS Derived life history files * (UKDA study number 3954, 5th Edition) (covering waves 1-14 only) (you want to access * the 3 files data files on the top directory of the 3954*.zip archive, * called newpan.dta, xlempe.dta, and xljobe.dta, plus the 3 files in the * 'episode' subfolder of the 3954*.zip archive, called l*.dta) * * - 2364a.dta (National Child Development Study teaching dataset 1958-1981, * UKDA study number 2364, Stata data files from the zip archive 2364Stata6.ZIP) * * * * *******************************************************. Example of a (.do) syntax file – This file has good clear annotation

  22. Stata Some more general points

  23. Stata Software – good points • Does all the simple stuff (SPSS) • Fits many more models than standard software (esp Longitudinal) • Specialist survey analysis functions (Svy) • Increasingly important for complex datasets (e.g. UKHLS) • You can get started easily (menus and help) • Strong documentation • There is a growing user community (lists etc) • New features emerge almost daily • There are good labour market opportunities (UK little known; USA well known)

  24. Stata Software – bad points • People are more used to the look of SPSS • Stata syntax has some quirks (e.g. set more off defining label, missing values etc) and a few unexpected limitations (e.g. constraints on display of tables of means) • There are some esoteric models that can’t be fitted (also some critiques of estimation procedures and their speed) • There is a growing user community, but they are generally GEEKBOYS (like myself!) • New features emerge almost daily these are sometimes tricky to get to grips with

  25. Session settings set more off set mem 64M ‘Capture’ command drop varnamebeforegen varname See values + labels: numlabel _all, add File info: codebook Data overwrite: use data1.dta, clear save data2.dta, replace Processing commands: Whole batch file with do (don’t double-click ‘do’ files) Interactive do-file editor Like SPSS syntax window Line breaks with /// Define file locations: global path1 “c:\data\lda\“ use $path1\file1.dta, clear Browsing data: list in ??? Using Stata – some user tipssee Treiman(2009) p.84

  26. Taking Stata further • Online resources • Stata website for FAQs, manual, training • Net use and update • Specialist modelling suites • XT – Cross sectional panel • ST – Survival data • SV – Survey data • Xtmixed - Multilevel models (v9) • GLLAMM • Programming: .do; .ado; macros

More Related