1 / 20

For the e-Stat meeting of 27 Sept 2010

This tool helps with metadata collection, storage, and organization of data resources in a generic way. It also allows for merging and standardizing variables using a fusion tool. Integration with e-Stat for model-building and pre-analysis adjustments, as well as collaboration with DAMES services.

johnathann
Download Presentation

For the e-Stat meeting of 27 Sept 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs

  2. 1) Progress updates • DAMES Node services of a hopefully generic/transferable nature • GESDE services on occupations, educational qualifications and ethnicity (www.dames.org.uk) • Data curation tool • Data fusion tool for merging data and recoding/standardising variables

  3. GESDE: online services for data coordination/organisation Tools for handing variables in social science data Recoding measures; standardisation / harmonisation; Linking; Curating DIR workshop: Handling Social Science Data

  4. The data curation tool The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way

  5. Fusion tool (invoking R) - scenarios

  6. Currently: Expected inputs to e-Stat, Autumn 2010 First applications in integrating DAMES data preparation tools with e-Stat model-building systems • {Coordination/planning on WP1.6 workflow tools for pre-analysis} (?De Roure, McDonald, Michaelides, Lambert, Goldstein, Southampton RA?) • Template construction with applications using variable recodes and other pre-analysis adjustments from DAMES systems with view to generating generic template facilities • Preparation of some ‘typical’ example survey data/models (e.g. 10k+ cases, 50+ variables) and their implementation in e-Stat e.g. Cross-national/longitudinal comparability examples • Possible e-Stat inputs to DAMES workshops (Nov 24-5/Jan 25-6)

  7. 7a) Links with DAMES • ..DAMES Node core funding period is Feb 2008- Jan 2011.. Further discussion of integrating pre-analysis services from DAMES into e-Stat facilities and templates Appetite for other application-oriented contributions? • Alternative measures for the ‘changing circumstances during childhood’ application? • ?Preparation of illustrative application(s) with complex survey data • Would need data spec. and broad analytical plan

  8. Pre-analysis options associated with DAMES Things that could be facilitated by the fusion tool (R scripts) in combination with the curation tool and if relevant specialist data (e.g. from GESDE) • Alternative measures/derived data • [via deterministic matches/variable transformation routines] • Using GESDE: Occupations, educational quals, ethnicity • (?Health oriented measures using Obesity e-Lab dropbox facility?) • Generic routines: Arithmetic standardisation tools • Replicability of measurement construction (e.g. syntax log of tasks) • Other possible data/review possibilities • [new but easy] Routine for summarizing data (see wish list) • [new, probably not easy] Weighting data options; routine for identifying values with high leverage / high residuals • (?provided elsewhere) Probabilistic matching routines

  9. Model for data locations? • ‘Curation tool’ can be used to attach variable names and metadata to facilitate variable processing • We then have a model of storing the data in a secure remote location (‘irods server’), from where jobs can be run on it (e.g. in R) • Is this a suitable model for e-Stat? • Is there another data location model? • Or better to supply scripts to run on files in an unspecified location?

  10. Fusion tool (invoking R) - scenarios

  11. Mechanism 1: Deterministic link • Here information is joined on the basis of exact matching values • Example condor job: universe=vanilla executable = /usr/bin/R arguments = --slave --vanilla --file=bhps_test.R --args /home/pl3/condor/condor_5/wave1.dta /home/pl3/condor/condor_5/wave17.dta /home/pl3/condor/condor_5/bhps_combined.dat pid wave file pid wave file notification = Never log = test1.log output = test1.out error = test1.err queue

  12. The input files here are Stata format data • The output is plain text format data • There are 3 linking variables, which happen to have the same names on both files • Ie ‘pid wave file’ on file 1, and also ‘pid wave file’ on file 2 • Different names would be fine but the same number of variables on both files is essential • Different total numbers of linking variables are fine (most often there is only 1) • Different R templates can be used to read data in different formats (e.g. Stata, SPSS, plain text), though exported data can only be readily supplied in plain text

  13. The R template being run in the above application is: args <- as.factor(commandArgs(trailingOnly = TRUE)); options(useFancyQuotes=TRUE) fileAinp <- as.character(args[1]) fileBinp <- as.character(args[2]) fileCout <- as.character(args[3]) ## library(foreign) fileA <- read.dta(fileAinp, convert.factors=F) fileB <- read.dta(fileBinp, convert.factors=F) nargs <- sum(!is.na(args)) allvars <- args[4:nargs] nargs2 <- (sum(!is.na(allvars))) first_vars <- as.character(allvars[1:(nargs2/2)]) second_vars <- as.character(allvars[((nargs2/2)+1):nargs2]) ###### combined2 <- merge(fileA, fileB, by.x=c(first_vars), by.y=c(second_vars), all.x=T, all.y=F, sort=F, suffixes = c(".x",".y") ) ###### write.table(combined2, file=fileCout, col.names=TRUE , sep=",") ###

  14. Mechanism 2: Probabilistic link • This is when data form different files are linked on criteria which are not just an exact match of values, but include some probabilistic algorithm • E.g. for each person in data 1 with the same characteristics, select a random person from the pool of people in data 2 who are age 35-40, male, education = high, marital status=married, and link their voting preference data to the person in data 1 • Other implementation requirements are equivalent to deterministic matching, so long as criteria for the matching algorithm is determined • Status: We don’t yet have a pool of probabilistic matching algorithms; we’ve one so far, which is random matching as in the above example

  15. Mechanism 3: Recoding/Transforming • Here the scenario is the application of an externally provided data recode, or other externally instructed arithmetic operation, onto a variable within data 1 • E.g. take the educational qualifications measure which is coded 1 to 20 in data 1; recode 1 thru 5 to the value 1, 6 thru 10 to the value 2, and all others to the value 3 (this is statistically equivalent to a deterministic match, but some recode inputs may not list every possible value) • E.g. take the measure of income and calculate its mean standardised values within subgroups defined by regions (e.g. minus regional mean, divided by regional standard deviation) • Status/Requirement: We need to develop a suitable mechanism to take recode style information/instructions from relevant external sources, and convert it into a suitable format for applying either a ‘recode’ or ‘merge’ routine in R • We’d like to support: • Recode information supplied via SPSS and Stata syntax specifications; data file matrices; and, potentially, manual specifications • Other transformation procedures supplied in advance from a small range of possibilities (e.g. mean standardisation; log transformation, cropping of extreme values) plus a small set of related arguments (e.g. category variables)

  16. Recode examples: Stata syntax: recode var1 1/5=1 6/10=2 *=3, generate(var2) SPSS syntax: recode var1 (1 thru 5=1) (6 thru 10=2) (else=3) /into=var2. Data matrix format: -> Manual entry interface (SPSS example):

  17. => Linking data management services into the e-Stat template Add ‘data review’ and ‘data construction’ elements, plus possible additional requests for modelling options • Data review: single script with minor variations on data • Data construction: As above, these involve variable operations and linkages with other files/resources • Derive measure on occupations, educational qualifications or ethnicity given information on the character of existing data • Collected via the cutation tool, or, more realistically, from a short range of pre-supplied alternatives? • Distributional transformations including standardisation; numeric transformation; review variable distribution • Model extensions: Weight cases options; leverage review;

  18. 8d) Wish lists/Suggestions • Include tools for describing/summarizing data • Outputs from generic ‘summarize’ commands in R linked to all templates • Tool for reviewing model results / leverage, feeding back into model respecificiation • Tools for applying survey weight variables to analysis(?) • User notes for models constructed (‘What was that?’) • Of benefit to novice and advanced practitioners • Potentially a part of the e-notebook, but could be a linked online guide (static) • E-Stat commands to provide documentation for replication • Terminologies used for the model/other user notes • Software equivalents or near equivalents (including estimator specs) • Algebraic expression and model abstract • Tools for storing/compiling multiple model results • (mentioned previously, cf. ‘est table’ in Stata)

  19. Possible components of ‘model description’ user notes

  20. Est store demo here

More Related