1 / 20

Overview of ONS Coding Tools Project

ONS Classification Coding Tools Project Occupation Classification Workshop RSS, London, 21 June 2004 Nigel Swier. Overview of ONS Coding Tools Project. Aim: To select and ‘operationalise’ a standard tool for assigning classification codes to verbatim text responses given in answer to a question

cullen
Download Presentation

Overview of ONS Coding Tools Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ONS Classification CodingTools ProjectOccupation Classification WorkshopRSS, London, 21 June 2004Nigel Swier

  2. Overview of ONS CodingTools Project • Aim: To select and ‘operationalise’ a standard tool for assigning classification codes to verbatim text responses given in answer to a question • Scope: • For all classifications (except ICD10 for cause of death coding), including occupation (SOC) and industry (SIC) • Both automatic and interactive coding functionality • Development of selected tool into a component so that can be used within the new ONS technical architecture • Context: Part of the ONS Statistical Infrastructure Development Project (itself part of the ONS Statistical Modernisation Programme).

  3. ONS formed in 1996 Central Statistics Office Office for National Statistics Office of Population Censuses and Surveys (OPCS) Employment Department

  4. Statistical Modernisation Programme (SMP) • Inherited Infrastructure: • Multiple databases • Multiple development tools • Proliferation of statistical tools and methods • Poor metadata • Paper-based dissemination • Risky statistical systems • ONS vision: • Single repository (Oracle) • Java (J2EE) • Standard statistical tools and methods (e.g. coding tool) • Corporate metadata system • Web-based dissemination • Robust statistical systems • £75 million to deliver SMP (2003-2006)

  5. Statistical Value Chain • Data Collection • Survey design • Survey case management • Operations on Unit Data • Editing • Imputation • Coding • Operations on Aggregate Data • Time series • Tabulation • Disclosure Control • Weighting • Estimation Dissemination Common ONS Statistical Tools Corporate ONS Repository for Data (CORD) ONS Metadata Repository

  6. Benefits of Statistical Modernisation • Robust statistical systems • Automated workflow: • More rapid publishing of statistical outputs • Improved efficiency • Improved job satisfaction • Data will be a corporate resource. Along with improved metadata it will allow ONS to leverage greater value from data holdings • Reduced licencing and IT support costs • Reduced staff training costs and easier transferability of staff

  7. Evaluation criteria • Functionality • Automatic and interactive coding • Able to handle simple and complex classifications • Dependent coding • Performance (coding/agreement rates) • Technical (fit with new ONS technical environment) • Supplier support • Impact on ONS outputs

  8. Evaluating and selecting the tool • Started (in earnest) January 2003 • Establish detailed evaluation criteria • Investigate tools and identify a shortlist (ACTR, PDC) • Obtain software, preparation of knowledge bases for testing, Preparation of test data • Testing (automatic coding performance) • Analysis of results • Evaluate supplier comments and tool functionality • Compilation of scores • Final Report (Completed December 2003) => recommendation to select ACTR

  9. ACTR - the selected tool • Automated Coding by Text Recognition • Developed by Statistics Canada • Used by Lockheed Martin for the Census 2001 Processing System • Automatic and interactive coding • Consists of coding engine and maintenance tools; customer builds and tunes the coding index • Generic: Can code a range of classifications • Flexible: Allows different coding strategies, thresholds • Has API and has been ported to UNIX/Windows • Multiple coding databases • Dependent coding using filters • Powerful parsing capabilities

  10. Parsing • Manipulation of text using global rules • Normalise, or reduce variation in text • Tune coding application • Examples: • Replace/delete string • Replace/delete word, (synonym list) • Delete clause • Applied to both reference files (i.e. coding index) and input files. • Parsing data + coding index = Knowledge base

  11. ACTR matching algorithm • Matching always follows parsing. • Step 1: Find direct matches and assign codes • Step 2: Find indirect matches (using Hellerman algorithm) • match scores based on word frequencies across index • unmatched words ignored (although more unmatched words lowers the score) • no fuzzy matching (except through parsing rules) • Step 3: Assign codes based on user defined match parameters.

  12. Building knowledge base forSOC 2000 • Based on SOC 2000 index • Obtain test/tuning data (Census 1991 recoded descriptions) • Development of parsing strategy • Iterative development • Index partitioned into 2 ‘contexts’ • Main index entries • Default index

  13. ACTR shortcomings • Non-linguistic, ignores word order (e.g. “Clerk to the Council” is not equivalent to “Council Clerk”) • No “fuzzy matching” (although particular cases of missing spaces and misspellings can be handled through parsing) • Longer text strings difficult to code automatically • No classifications mapping facility

  14. Next steps? • Short term: Building knowledge bases • Medium term: Implementing ACTR in individual business areas: • ASHE (Earnings) for coding occupation in April 2005 • IDBR (Industry) • Medium/Long term: “Operationalising” ACTR in the new ONS environment, including CORD etc.

  15. The End

More Related