1 / 9

U.S. Department of Agriculture National Agricultural Statistics Service

UNECE Work Session on Statistical Data Editing Topic (v) : New and emerging methods Working Paper #32 Further Improvements to an Edit and Imputation System for the 2007 United States Census of Agriculture. U.S. Department of Agriculture National Agricultural Statistics Service.

raja
Download Presentation

U.S. Department of Agriculture National Agricultural Statistics Service

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UNECE Work Session on Statistical Data EditingTopic (v) : New and emerging methodsWorking Paper #32Further Improvements to an Edit and Imputation System for the 2007 United States Census of Agriculture U.S. Department of Agriculture National Agricultural Statistics Service Presented by Dale Atkinson In Bonn, Germany -- September 27, 2006 Slide 1

  2. Slide 2 • U.S. Census of Agriculture • Reflecting on the Past, but Looking Ahead • Scope and Background • Successes of the 2002 Census • Less Successful Aspects (LSAs) of the 2002 Census • Progress in Addressing these LSAs for 2007

  3. Slide 3 • Scope and Background • Quinquennial census of all US farms • Responsibility of BOC through 1997; USDA/NASS since 1997 • NASS implemented sweeping changes in 2002

  4. Slide 4 • Successes of the 2002 Census • National Processing Center • Scanning for Image • DLTs/Authoring System • Data Review System • Analysis System

  5. Slide 5 • Less Successful Aspects (LSAs) of the 2002 Census • Scanning for data capture (OCR) • Database model • Implementation of edit code • Imputation/donor pool creation and maintenance • Testing

  6. Slide 6 • Progress in Addressing these LSAs for 2007 • Have eliminated scanning for data capture • System is basically in place and largely tested for intelligent mark recognition (IMR) with key from image. This change alone will eliminate numerous editing problems • Have improved the database model • The 2002 dual database model (Sybase and RedBrick) has been revamped using only Sybase, which will eliminate most synchronization issues that characterized and plagued 2002 processing

  7. Slide 7 • Progress in Addressing these LSAs for 2007 • Have dramatically improved the edit system • Have been able to build on the final, production-tuned DLTs from 2002. Didn’t have to reinvent the wheel this time! • Have redesigned the edit architecture. In early testing, the new version is running about 60 times faster, as a result of software improvements and more efficient use of hardware!

  8. Slide 8 • Progress in Addressing these LSAs for 2007 • Have similarly improved imputation methodology and donor pool design/creation • Start with a Census Data Repository (CDR) seeded with 2002 census data and 2006 census content data (~1.5 million records) • Update the CDR each night with current “clean” 2007 census records • Use a fast dynamic clustering algorithm to create “donor pools” (of approximately 100 records each) from the CDR using farm similarity parameters • Use matching variables -- highly correlated with the specific item requiring imputation -- for quick, effective extraction of a nearest neighbor from within the appropriate pool, based on a predetermined distance measure • Require all imputations to pass edits

  9. Slide 9 • Progress in Addressing these LSAs for 2007 • Are developing and implementing QC/QA processes • Through implementing a dashboard approach, we’ll be able to monitor the data quality and system performance in real-time • And most importantly: • Are allowing adequate development and testing time • System development for 2007 is more than a year ahead of where it was leading up to 2002 • Individual module testing is well underway • Integrated testing will start next month, allowing us a full 15 months of stress testing the system • No more “beta-duction”

More Related