1 / 63

Creating Something from Nothing: Synthetic and Dummy files

Creating Something from Nothing: Synthetic and Dummy files. Bo Wandschneider University of Guelph Chuck Humphrey University of Alberta. DLI Training: Ottawa, May, 2003. Outline. Types of data Files Implications for analysis Where do we get access Which file is appropriate

anevay
Download Presentation

Creating Something from Nothing: Synthetic and Dummy files

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Creating Something from Nothing:Synthetic and Dummy files Bo Wandschneider University of Guelph Chuck Humphrey University of Alberta DLI Training: Ottawa, May, 2003

  2. Outline • Types of data Files • Implications for analysis • Where do we get access • Which file is appropriate • Providing service with synthetic files • NPHS: an exercise • SLID: an exercise

  3. Types of Data Files • Microdata • Confidential Microdata Products • Master Files • Share Files • Public Access Microdata Products • Public Use Anonym zed microdata (PUMFS) • Synthetic Files

  4. Microdata raw data organized in a file where the records or lines in the file are observations of a specific unit of analysis and the information on the lines are the values of variables requires some form of processing or analysis to be used Microdata Products

  5. Microdata - SCF Example 000011031000+025607+000000+025607+000337+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+025944+006481+0194632331000000000090922201200000000000222+0232111000+000000+0000003000000000000000002228233411412190638749500575211004600132 000021031000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+0000001663000000000060824432200000000000632+0000000000+000000+0000000000000000000000003116121111435481500777500570033004300110 000031031000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+0000001663000000000040521112200000000000432+0206261110+000636+0000003000000000000000002228213411436491600778500570033004200085 000041031000+002080+000000+002080+000000+000575+000522+000000+000000+002574+000000+000000+003671+003149+000522+000000+000000+005751+000000+0057514551000000000060824432200000000000532+0220101021+000575+0005223000000000000000002240223411431251000774500571622361600065 000051031000+018050+000000+018050+000000+000288+000261+000000+000000+000000+000000+000000+000549+000288+000261+000000+001179+019778+002463+0173152221000000000050522201200000000000432+0000001011+000288+0002611000000000000000001246123411411440748739500575011021600046 000061031000+001500+000000+001500+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+001500+000000+0015002551000000000101024501200000000000631+0000000000+000000+0000000000000000000000003123263411431071300773500571612004300094 000071031000+000000+000000+000000+000000+000000+000000+002540+000000+000000+000000+000000+002540+002540+000000+000000+000000+002540+000000+0025404152000000000010340201200000000000222+0121134000+000000+0000003000000000000000002269233411436491600778500570033004200041 000081031000+008400+000000+008400+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+008400+000858+0075422551000000000080823301200000000000332+0000000000+000000+0000000000000000000000003118133411411210848739500575211004600055 000091031000+026000+000000+026000+000000+000287+000156+000000+000000+000879+000000+000000+001322+001166+000156+000000+000000+027322+004335+0229872231000000000070823422200000000000642+0000001012+000287+0001561000000000000000001248113411431400300774500564512071600060 000101031000+000000+000000+000000+000157+000000+000000+005043+000000+000000+000000+000000+005043+002541+002502+000000+000000+005200+000000+0052004652000000000040622312200000000000642+0000000000+000000+0000002000000000000000004376213411436491600778500570033004400076 000111031000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+0000001663000000000020341213100000000000462+0000000000+000000+0000000000000000000000003119213411435481500777500570033004500040 000121031000+000991+000000+000991+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000000+000991+000000+0009912551000000000020343322100000000000433+0000000000+000000+0000000000000000000000003117121311432231400773500571222004300244 000131031000+027716+000000+027716+000000+000288+000000+000000+000000+000000+000000+000000+000288+000288+000000+000000+000000+028004+006243+0217612221000000000070722201200000000000331+0034071100+000288+0000001000000000000000001226163411411431138739500575211004600156 000141031000+010000+000000+010000+000000+000600+000000+000000+000000+000000+000000+000000+000600+000600+000000+000000+000000+010600+000686+0099142331000000000040422201200000000000433+0077001011+000600+0005221000000000000000001260123411411440636719500573012221600148 000151031000+000750+000000+000750+000000+000000+000370+000000+000000+000000+000000+000000+000370+000000+000370+000000+000000+001120+000000+0011202551000000000080823313200000000000633+0323511032+001126+0003703000000000000000002245223411411261318529500575222004600132 000161031000+007012+000000+007012+000165+000000+000000+000000+000000+003082+000000+000000+003082+003082+000000+000000+000000+010259+001356+0089032541000000000070824432200000000000531+0000000000+000000+0000000000000000000000003118123411421320320439500573522171600111 000171031000+002027+000000+002027+000000+000000+000000+000000+000000+000000+000000+000000+ Microdata Products

  6. Master Files These files contain the fullness of detail captured about the unit of observation. The information in these files can identify the individual who provided the original information and, therefore, are considered confidential. Confidential Microdata

  7. Confidential Microdata Master File – Example

  8. Confidential Microdata Master File - Personal identifiers

  9. Confidential Microdata Master File – Geography (SLID)

  10. Confidential Microdata Master File - Fullness of Data (NPHS)

  11. Confidential Microdata Master File - Fullness of Data

  12. Confidential Microdata Master File - Fullness of Data (SLID)

  13. Confidential Microdata Master File - Fullness of Data

  14. Share Files these are confidential files in which the respondents have signed a consent form permitting Statistics Canada to allow access to their information for approved research. Used with NPHS and NLSCY Confidential Microdata

  15. Anonymized Microdata these microdata are specially prepared to minimize the possibility of disclosing or identifying any of the cases or observations the original data from the master file are edited to create a public use microdata file Public Access Microdata

  16. Steps in Anonymizing Microdata removal of all personal identification information (names, addresses, etc) include only gross levels of geography collapse detailed information into a smaller number of general categories suppress the values of a variable Public Access Microdata

  17. Statistics Canada PUMFs only available for select social surveys that undergo a review of the Data Release Committee, an internal Statistics Canada committee no ‘enterprise’ public use microdata Public Access Microdata

  18. Statistics Canada PUMFs almost all are cross-sectional, that is, represent data collected at one point in time longitudinal data are difficult to anonymize while maintaining any useful information Public Access Microdata

  19. PUMFs – personal identifiers Public Access Microdata

  20. PUMFs – gross geography Public Access Microdata

  21. PUMFs – collapsed data Public Access Microdata

  22. PUMFs – suppressed data Public Access Microdata

  23. Public Access Microdata Synthetic Files • These microdata do not contain actual ‘real’ cases but are pseudo-cases that provide aggregate results close to the ‘real’ cases

  24. Public Access Microdata Synthetic Files • They have been prepared to create analysis runs with the master file without possibly disclosing or identifying any of the cases

  25. Public Access Microdata Synthetic Files • The results are not to be reported; strictly to be used to prepare analyses of master files • Usually associated with longitudinal files

  26. Public Access Microdata Steps in creating Synthetic Files • Observations are transformed • No records actually exist • Keep fullness of detail

  27. Public Access Microdata Synthetic Files – NPHS example

  28. Public Access Microdata Synthetic Files – NPHS 1999 general file

  29. Public Access Microdata Synthetic Files – NPHS 1999

  30. Public Access Microdata Synthetic Files – NPHS 1999

  31. Implications for Analysis What are the implications in doing analysis with these different types of microdata files?

  32. Implications for Analysis Master File • All observations • Has the most variables with the most detail • Lots of geography and personal characteristics • Little grouping or capping of categories

  33. Implications for Analysis Master File • Restricted access: only available to authorized Statistics Canada employees, which includes ‘deemed employees’

  34. Implications for Analysis Master File • Includes linkage variables across files within a study, e.g., NLSCY linkage among the files for different units of analysis (kids, parents, teachers)

  35. Implications for Analysis Public Use Microdata (PUMF) • Suppressed observations • Suppressed variables: removed from the file • Suppressed content • Gross geography • Collapsed categories • Capped values

  36. Implications for Analysis Public Use Microdata (PUMF) • Licensed product: agree to certain terms of use • No linkage to multiple units of analysis, with a few exceptions (GSS Time Use and Family)

  37. Implications for Analysis Synthetic Files “Looks like a duck and quacks like a duck”, but it isn’t a duck or any other type of fowl.

  38. Implications for Analysis Synthetic Files • Looks like master files • Lots of observations • Lots of variables • Little grouping or capping of categories • Lots of geographic detail

  39. Synthetic Files Precautions • Results not authentic – but close in the aggregate • Use for testing analysis setups only • Still need the master files for publishable results

  40. Where do we get Access? Master File • Restricted access governed under the Statistics Act • Remote Job Submission • Research Data Centres • Apply to SSHRC to obtain a peer-reviewed proposal and STC for security clearance

  41. Where do we get Access? Public Use Microdata Files (PUMF) • Get from DLI • Analyze where ever is convenient • Can use a variety of analysis software, including SAS, SPSS, Stata, HLM, LISREL, etc. • Slidret sans data

  42. Where do we get Access? Synthetic Files • Author Divisions ‘may’ create it • Most relevant when dealing with new Panel Data, but not necessarily, e.g., the Census has potential • NPHS synthetic files on DLI FTP site

  43. Where do we get Access? Synthetic files • SLID, WES, YITS coming ???? • Do we need to encourage them? • Work with locally • Build SAS and SPSS setups

  44. Which File is Appropriate? • 1st stop is still the PUMF • This file has the easiest access for us • Probably meets the needs of most clients • Not as administratively burdensome as synthetic or master file • Perfect for clients just looking for ‘data’ – courses in quantitative analysis

  45. Which File is Appropriate? • If more detail is needed, refer to the Master File Documentation (similar to Synthetic File Documentation) • Make them aware that the cost of use is higher, both in terms of accessibility and analytical requirements • Interest most likely to come from grad students and ‘experienced’ researchers

  46. Which File is Appropriate? • Download the Synthetic files from DLI • Make them aware of problems with synthetic files – RESULTS ARE NOT PUBLISHABLE • Encourage them to submit an application for RDC access – there is a time lag

  47. Which File is Appropriate? RDC

  48. Which File is Appropriate? • Some of you may work with client using synthetic files before passing her/him off to RDC

  49. Services for Synthetic Files DLI Contacts can provide four basic services with synthetic files. • Build SPSS and SAS system files from the raw synthetic data files that are distributed through DLI; • Provide information about the use of Remote Job Submission (a.k.a, Remote Access) and RDC’s;

  50. Services for Synthetic Files • Assist with finding variables in the synthetic files; • Provide instruction about ways of capturing SPSS or SAS code from “dummy” analysis runs with the synthetic files. It is this code that is then submitted to STC through remote job submission.

More Related