160 likes | 230 Views
Explore the challenges, solutions, and benefits of switching from tabfiles to multiple databases for longitudinal data management. Learn about the process, advantages, and considerations for this transformation.
E N D
Multiple Databases or Tabfiles? Frances Williams ISER, University of Essex SIR UK Conference London, June 2007
Introduction • Overview of problem • Previous solution • Proposed solution • Reluctance to Change • Why we took the plunge • How things went • Are tabfiles redundant? • Conclusion SIR UK Conference London, June 2007
Overview of problem • British Household Panel Study (BHPS) • Longitudinal study • Interview same people every year • 16th wave • Two separate database structures • Survey Database for cleaning • User database for release to research community • Used SIR2K SIR UK Conference London, June 2007
Overview of problem (cont) • Survey database (SDB) • One for each wave • Organised for cleaning purposes • Highly structured • Reflects questionnaire • Variable names reflect question numbers SIR UK Conference London, June 2007
Overview of problem (cont) • User database (UDB) • One database incorporating all waves • Flattened structure • Easier for users • Mnemonic variable names • Consistent across time • Year prefix indicates the wave • A – wave 1, B - wave 2 etc • Lots of non-questionnaire variables • Derived variables • Imputed variables • Weighting variables SIR UK Conference London, June 2007
Overview of problem (cont) • Need to convert Survey database into User database • Done on annual basis • When data are clean • Large number of variables to transpose • 40% variables change each year • Some variables merged • Numeric and DK/Ref variables • ~1500 variables on 40+ record types -> ~1300 variables on 7 record types • Some values changed • DK, Refused, Missing and Inapplicable given negative values SIR UK Conference London, June 2007
Previous solution • Developed 15 years ago in SIR2.2 • Uses intermediate tabfile • Stage 1 • Variables from SDB converted to UDB vars (minus wave prefix) • Saved in table equivalent to UDB record • Stage 2 • Tabfile rows converted to UDB database records SIR UK Conference London, June 2007
Previous solution (cont) • Disadvantages • Had to be done in 2 stages • Stage 1 using SDB • Stage 2 building UDB • Time consuming • Each process took over an hour to run • If SDB changed, both process had to be re-run • Error prone • Two places for errors to occur • Extra maintenance • Annual changes to be done in two places SIR UK Conference London, June 2007
Proposed solution • Change to SIR2002 • Allows multiple databases in a retrieval • Would allow one-stage transposition • Quicker • Easier • Less error prone • Less annual maintenance • Less space intensive SIR UK Conference London, June 2007
Reluctance to Change • Lots of work to convert system • 15 databases • 28 tabfiles • SIR 2K doesn’t export correctly • Numerics can get split over two lines • Need to hand-edit export files • Tedious process to fix • Would I be able to get the multiple database code to work? • Solaris • Caseless databases • Will existing code compile? • Conversion from SIR 2.2 to SIR 2K required lots of code revision SIR UK Conference London, June 2007
Why we took the plunge • SIR Oz would not fix SIR 2K export problem • Either stick with SIR 2K indefinitely or deal with problem • The longer we left it the more databases and tabfiles there would be to convert • Multiple databases and other new features in SIR2002 would be very useful in other areas of our systems SIR UK Conference London, June 2007
How things went • It was very long and tedious converting all the databases and tabfiles • Got there in the end • There were some compilation problems • Not as many as when we went from SIR 2.2 to SIR 2K • Do repeat var = A to B doesn’t work when A and B are the same SIR UK Conference London, June 2007
How things went (cont) • Some problems in getting multiple databases to work • Lots of help from SIR Oz • Resultant code much better • One stage process • More maintainable • Much faster • Well worth the effort • Wouldn’t want to do it again • But glad we did do it! SIR UK Conference London, June 2007
Are tabfiles redundant? • Three common situations where I have used tabfiles • Transposition of data from one database to another • Worth updating to SIR 2002 or SIR XS and rewriting code • Static file containing data to be accessed in several different ways by different applications • In the BHPS we have a large number of these • In a new system I would use databases • Not worth rewriting SIR UK Conference London, June 2007
Are tabfiles redundant? (cont) • Temporary file to hold external data • Read in from text file – e.g .csv file • Stored in tabfile and indexed as required • Dropped at end of application • Much easier to use tabfile than database • No schema required • Can be created and dropped ‘on the fly’ SIR UK Conference London, June 2007
Conclusion • Multiple databases are great • Remove need for intermediate tabfiles when accessing data from more than one database • Tabfiles have their uses • Temporary store for external data • Horses for courses • SIR Oz - don’t ever get rid of tabfiles! SIR UK Conference London, June 2007