90 likes | 227 Views
UNECE Work Session on Statistical Data Editing Topic (v) : New and emerging methods Working Paper #32 Further Improvements to an Edit and Imputation System for the 2007 United States Census of Agriculture. U.S. Department of Agriculture National Agricultural Statistics Service.
E N D
UNECE Work Session on Statistical Data EditingTopic (v) : New and emerging methodsWorking Paper #32Further Improvements to an Edit and Imputation System for the 2007 United States Census of Agriculture U.S. Department of Agriculture National Agricultural Statistics Service Presented by Dale Atkinson In Bonn, Germany -- September 27, 2006 Slide 1
Slide 2 • U.S. Census of Agriculture • Reflecting on the Past, but Looking Ahead • Scope and Background • Successes of the 2002 Census • Less Successful Aspects (LSAs) of the 2002 Census • Progress in Addressing these LSAs for 2007
Slide 3 • Scope and Background • Quinquennial census of all US farms • Responsibility of BOC through 1997; USDA/NASS since 1997 • NASS implemented sweeping changes in 2002
Slide 4 • Successes of the 2002 Census • National Processing Center • Scanning for Image • DLTs/Authoring System • Data Review System • Analysis System
Slide 5 • Less Successful Aspects (LSAs) of the 2002 Census • Scanning for data capture (OCR) • Database model • Implementation of edit code • Imputation/donor pool creation and maintenance • Testing
Slide 6 • Progress in Addressing these LSAs for 2007 • Have eliminated scanning for data capture • System is basically in place and largely tested for intelligent mark recognition (IMR) with key from image. This change alone will eliminate numerous editing problems • Have improved the database model • The 2002 dual database model (Sybase and RedBrick) has been revamped using only Sybase, which will eliminate most synchronization issues that characterized and plagued 2002 processing
Slide 7 • Progress in Addressing these LSAs for 2007 • Have dramatically improved the edit system • Have been able to build on the final, production-tuned DLTs from 2002. Didn’t have to reinvent the wheel this time! • Have redesigned the edit architecture. In early testing, the new version is running about 60 times faster, as a result of software improvements and more efficient use of hardware!
Slide 8 • Progress in Addressing these LSAs for 2007 • Have similarly improved imputation methodology and donor pool design/creation • Start with a Census Data Repository (CDR) seeded with 2002 census data and 2006 census content data (~1.5 million records) • Update the CDR each night with current “clean” 2007 census records • Use a fast dynamic clustering algorithm to create “donor pools” (of approximately 100 records each) from the CDR using farm similarity parameters • Use matching variables -- highly correlated with the specific item requiring imputation -- for quick, effective extraction of a nearest neighbor from within the appropriate pool, based on a predetermined distance measure • Require all imputations to pass edits
Slide 9 • Progress in Addressing these LSAs for 2007 • Are developing and implementing QC/QA processes • Through implementing a dashboard approach, we’ll be able to monitor the data quality and system performance in real-time • And most importantly: • Are allowing adequate development and testing time • System development for 2007 is more than a year ahead of where it was leading up to 2002 • Individual module testing is well underway • Integrated testing will start next month, allowing us a full 15 months of stress testing the system • No more “beta-duction”