For the e-Stat meeting of 27 Sept 2010

For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs

1) Progress updates • DAMES Node services of a hopefully generic/transferable nature • GESDE services on occupations, educational qualifications and ethnicity (www.dames.org.uk) • Data curation tool • Data fusion tool for merging data and recoding/standardising variables

GESDE: online services for data coordination/organisation Tools for handing variables in social science data Recoding measures; standardisation / harmonisation; Linking; Curating DIR workshop: Handling Social Science Data

The data curation tool The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way

Fusion tool (invoking R) - scenarios

Currently: Expected inputs to e-Stat, Autumn 2010 First applications in integrating DAMES data preparation tools with e-Stat model-building systems • {Coordination/planning on WP1.6 workflow tools for pre-analysis} (?De Roure, McDonald, Michaelides, Lambert, Goldstein, Southampton RA?) • Template construction with applications using variable recodes and other pre-analysis adjustments from DAMES systems with view to generating generic template facilities • Preparation of some ‘typical’ example survey data/models (e.g. 10k+ cases, 50+ variables) and their implementation in e-Stat e.g. Cross-national/longitudinal comparability examples • Possible e-Stat inputs to DAMES workshops (Nov 24-5/Jan 25-6)

7a) Links with DAMES • ..DAMES Node core funding period is Feb 2008- Jan 2011.. Further discussion of integrating pre-analysis services from DAMES into e-Stat facilities and templates Appetite for other application-oriented contributions? • Alternative measures for the ‘changing circumstances during childhood’ application? • ?Preparation of illustrative application(s) with complex survey data • Would need data spec. and broad analytical plan

Pre-analysis options associated with DAMES Things that could be facilitated by the fusion tool (R scripts) in combination with the curation tool and if relevant specialist data (e.g. from GESDE) • Alternative measures/derived data • [via deterministic matches/variable transformation routines] • Using GESDE: Occupations, educational quals, ethnicity • (?Health oriented measures using Obesity e-Lab dropbox facility?) • Generic routines: Arithmetic standardisation tools • Replicability of measurement construction (e.g. syntax log of tasks) • Other possible data/review possibilities • [new but easy] Routine for summarizing data (see wish list) • [new, probably not easy] Weighting data options; routine for identifying values with high leverage / high residuals • (?provided elsewhere) Probabilistic matching routines

Model for data locations? • ‘Curation tool’ can be used to attach variable names and metadata to facilitate variable processing • We then have a model of storing the data in a secure remote location (‘irods server’), from where jobs can be run on it (e.g. in R) • Is this a suitable model for e-Stat? • Is there another data location model? • Or better to supply scripts to run on files in an unspecified location?

Fusion tool (invoking R) - scenarios

Mechanism 1: Deterministic link • Here information is joined on the basis of exact matching values • Example condor job: universe=vanilla executable = /usr/bin/R arguments = --slave --vanilla --file=bhps_test.R --args /home/pl3/condor/condor_5/wave1.dta /home/pl3/condor/condor_5/wave17.dta /home/pl3/condor/condor_5/bhps_combined.dat pid wave file pid wave file notification = Never log = test1.log output = test1.out error = test1.err queue

The input files here are Stata format data • The output is plain text format data • There are 3 linking variables, which happen to have the same names on both files • Ie ‘pid wave file’ on file 1, and also ‘pid wave file’ on file 2 • Different names would be fine but the same number of variables on both files is essential • Different total numbers of linking variables are fine (most often there is only 1) • Different R templates can be used to read data in different formats (e.g. Stata, SPSS, plain text), though exported data can only be readily supplied in plain text

The R template being run in the above application is: args <- as.factor(commandArgs(trailingOnly = TRUE)); options(useFancyQuotes=TRUE) fileAinp <- as.character(args[1]) fileBinp <- as.character(args[2]) fileCout <- as.character(args[3]) ## library(foreign) fileA <- read.dta(fileAinp, convert.factors=F) fileB <- read.dta(fileBinp, convert.factors=F) nargs <- sum(!is.na(args)) allvars <- args[4:nargs] nargs2 <- (sum(!is.na(allvars))) first_vars <- as.character(allvars[1:(nargs2/2)]) second_vars <- as.character(allvars[((nargs2/2)+1):nargs2]) ###### combined2 <- merge(fileA, fileB, by.x=c(first_vars), by.y=c(second_vars), all.x=T, all.y=F, sort=F, suffixes = c(".x",".y") ) ###### write.table(combined2, file=fileCout, col.names=TRUE , sep=",") ###

Mechanism 2: Probabilistic link • This is when data form different files are linked on criteria which are not just an exact match of values, but include some probabilistic algorithm • E.g. for each person in data 1 with the same characteristics, select a random person from the pool of people in data 2 who are age 35-40, male, education = high, marital status=married, and link their voting preference data to the person in data 1 • Other implementation requirements are equivalent to deterministic matching, so long as criteria for the matching algorithm is determined • Status: We don’t yet have a pool of probabilistic matching algorithms; we’ve one so far, which is random matching as in the above example

Mechanism 3: Recoding/Transforming • Here the scenario is the application of an externally provided data recode, or other externally instructed arithmetic operation, onto a variable within data 1 • E.g. take the educational qualifications measure which is coded 1 to 20 in data 1; recode 1 thru 5 to the value 1, 6 thru 10 to the value 2, and all others to the value 3 (this is statistically equivalent to a deterministic match, but some recode inputs may not list every possible value) • E.g. take the measure of income and calculate its mean standardised values within subgroups defined by regions (e.g. minus regional mean, divided by regional standard deviation) • Status/Requirement: We need to develop a suitable mechanism to take recode style information/instructions from relevant external sources, and convert it into a suitable format for applying either a ‘recode’ or ‘merge’ routine in R • We’d like to support: • Recode information supplied via SPSS and Stata syntax specifications; data file matrices; and, potentially, manual specifications • Other transformation procedures supplied in advance from a small range of possibilities (e.g. mean standardisation; log transformation, cropping of extreme values) plus a small set of related arguments (e.g. category variables)

Recode examples: Stata syntax: recode var1 1/5=1 6/10=2 *=3, generate(var2) SPSS syntax: recode var1 (1 thru 5=1) (6 thru 10=2) (else=3) /into=var2. Data matrix format: -> Manual entry interface (SPSS example):

=> Linking data management services into the e-Stat template Add ‘data review’ and ‘data construction’ elements, plus possible additional requests for modelling options • Data review: single script with minor variations on data • Data construction: As above, these involve variable operations and linkages with other files/resources • Derive measure on occupations, educational qualifications or ethnicity given information on the character of existing data • Collected via the cutation tool, or, more realistically, from a short range of pre-supplied alternatives? • Distributional transformations including standardisation; numeric transformation; review variable distribution • Model extensions: Weight cases options; leverage review;

8d) Wish lists/Suggestions • Include tools for describing/summarizing data • Outputs from generic ‘summarize’ commands in R linked to all templates • Tool for reviewing model results / leverage, feeding back into model respecificiation • Tools for applying survey weight variables to analysis(?) • User notes for models constructed (‘What was that?’) • Of benefit to novice and advanced practitioners • Potentially a part of the e-notebook, but could be a linked online guide (static) • E-Stat commands to provide documentation for replication • Terminologies used for the model/other user notes • Software equivalents or near equivalents (including estimator specs) • Algebraic expression and model abstract • Tools for storing/compiling multiple model results • (mentioned previously, cf. ‘est table’ in Stata)

Possible components of ‘model description’ user notes

Est store demo here

For the e-Stat meeting of 27 Sept 2010

For the e-Stat meeting of 27 Sept 2010

Presentation Transcript

Minnesota Wing Safety Meeting Sept 2010

Meeting Agenda Monday, September 27, 2010

Sept. 27, 2012

Sept 2010

Cochise Chapter Meeting Sept 27, 2012

Sept. 23-27, 2013

Marketing 1 Sept. 27

BELLRINGER Sept 27/Sept 30

27 Sept. 2010

Sept.23 – 27, 2013

Sept 27 – Gov – Constitution

Bellwork : Sept. 26-27

District B Meeting- Sept 16 th , 2010

Overview of FANRPAN Sept 2010 - Sept 2011

Advanced Logistics Simulation Sept. 27-30, 2010

STAT E-100

2010 Fall Faculty Meeting October 27, 2010

Sept. 27, 2006