using ocr for census data capture in china national bureau of statistics of china n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Using OCR for Census Data Capture in China National Bureau of Statistics of China PowerPoint Presentation
Download Presentation
Using OCR for Census Data Capture in China National Bureau of Statistics of China

Loading in 2 Seconds...

play fullscreen
1 / 17

Using OCR for Census Data Capture in China National Bureau of Statistics of China - PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on

Using OCR for Census Data Capture in China National Bureau of Statistics of China. Background. 5 population censuses have been conducted in 1953, 1964, 1982, 1990, 2000 respectively 1953,1964 census: manual tabulation Since 1982 census, using computer for data process.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Using OCR for Census Data Capture in China National Bureau of Statistics of China' - colt-nelson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
background
Background
  • 5 population censuses have been conducted in 1953, 1964, 1982, 1990, 2000 respectively
  • 1953,1964 census: manual tabulation
  • Since 1982 census, using computer for data process.
  • 1982,1990 census, manual data entry
  • 2000 census, using OCR for data capture
  • 2006, the second Agriculture census: also use OCR for data capture

2000 population census and 2006 agriculture census are the two cases of OCR use for large-volume data capture

two cases of ocr for large volume census data capture
Two cases of OCR for large-volume Census Data Capture
  • The data capture of 2000 Population Census
    • Census reference time: Nov. 1,2000
    • Data Capture cycle:Jan. -- June., 2001.(6 months)
    • Scale:
      • Types of Census Form:4
        • Short Form: 49 Census items, 90% HH, 360 million

A4 size double sheets in total

        • Long Form: 95 Census items, 10% HH, 40 million

A3 double sheets in total

        • Other Forms: Death pop, temporary residents. 10

million A4 double sheets in total

      • Original Census data:About 64 GB
      • Image volume: 5.5TB
two cases of ocr for large volumes census data capture
Two cases of OCR for large-volumes Census Data Capture
  • The Second National Agricultural Census
    • Census reference time: end of 2006
    • Data Capture cycle :April to mid-July, 2007,100 days
    • Scale:
      • Types of Census Form:8
      • Total census items:541
      • Total agricultural Families:250 millions
      • Total Census Forms:about 500 million pieces of paper
      • Original Census data :about 300GB
      • Image data :40TB
organizational structure for data process
Organizational Structure for data process

Data process

Data

Capture

editing

NBS

Province

Coding

31

Checked

& packed

Prefecture

Checked

& packed

340

Checked

& packed

1

3

1

County

Data capture was decentralized at prefecture offices

Town

2847

Village

40000

0.9 million

EA

5 million

function framework of ocr data capture 2006 agriculture census
Function framework of OCR data capture(2006 agriculture census)

User management

System Initialization

System management

Log management

Space management

Sys management

Address base

Client management

task management

Archiving management

process management

QA

Scanner self-inspection

Forms check

scanning

Batchform scan

Add scan

Alternative scan

Repeat scan

System Functions

Check scan

Chinese character

numeric data

English character

OCR

Special character

Chinese characteredit

Generation census form management ID

checkup

numeric data edit

Editing and checkup

English character edit

Special character edit

Generation image management ID

Backup

Restore

Delete

Data management

Browse

Input

Output

Backup

Restore

Delete

Enquiry

Browse

Image management

Import

export

Image merger

Image reported

Receiving file

Statistics summary

Progress monitoring

Information display

the process of ocr data capture
The Process of OCR data capture
  • The scanning module generates image files and transmits them to image management module and also transmits the status information to task management module.
  • The task management module executes task distribution according to the state of vacancy of each OCR clients.
slide8

The Process of OCR data capture

  • The OCR module performs recognition of numerical data and Chinese characters and transmits the data and Chinese characters to data management module and transmits the status information to task management module.
slide9

The Process of OCR data capture

  • The task management dispatches the data to edit module for editing. If original image is needed, corresponding image is fetched by image management module for comparison, the cleansed data after edit are returned back to data management module. when data capture work is all finished, report upward the data.
quality control
Quality Control

To ensure the quality of captured data, quality control is executed in three stages: scanning, recognizing and data editing.

  • During the process of scanning, recognizing batch cover data and scanner count, the system checks if the total page count, total household count for each batch are consistent with the results of scanning; Comparing the actual address code with address code repository, ensure that the address codes are validity, uniqueness and correctness.
  • During the recognition, collecting real time statistics for rejection ratio and suspect ratio. If rejection ratio and suspect ratio is too high, the task administrator checks the reason.
quality control1
Quality Control
  • During the process of editing, checking the consistency between recognized record count and the record count in controller document; Checking the basic logic relationship and value range; indicate the items which have mistakes in logic relationships or value ranges, recognition results and corresponding items from original scanned images are displayed comparatively in parallel windows, and convenient modification means are provided for those which need get modified.
  • After the whole set of data has been captured, quality is assured through executing sampling quality check through all phases
main problems and solutions
Main Problems and Solutions

In large-scale census data capture projects, there’re three aspects of problems we regard as the most outstanding: 1. How to enhance OCR’s recognition capability. 2. Availability and reliability of the system. 3. Project management. What we have done are:

1. Improve the capability of recognizing numeric characters

Two kinds of recognition algorithms and two kinds of recognition engines based on the two algorithms were developed, after a series of onsite test, which better suites the census project is chosen.

2. Improve the recognition capability for Chinese characters

By collecting large number of actual samples and training the recognizer, recognition capability for Chinese names is improved.

main problems and solutions1
Main Problems and Solutions

3. Improve orientation capability

Aiming at print deviation and filling deviation, smart locating algorithm has been developed which has minimized the impact of the print deviation and filling deviation.

4. Enhance efficiency of recognition

Improve the fundamental software of scanner, to achieve the best match between hardware drivers and OCR software and improve the efficiency of recognition.

5. Improve the quality of forms filling

Prescribe the filling standards for form filling so that OCR error rate will be reduced, meanwhile rejection rate could also be reduced.

main problems and solutions2
Main Problems and Solutions

6. Establish regulation, working guidance and processes to make every data entry site to execute work following uniform regulations, processes and standards.

7. Strengthen the training. we organized centralized training and on-site training for the users. Lecturing and actual operations are combined during centralized training, through the combination of these two ways, the familiarity with the system has get deepened.

8. Organize multi-target pilot. We organized multiple pilots in many locations aiming at different targets.

lessons learned
Lessons Learned
  • Using advanced technology to raise efficiency
  • Combining technical and administrative methods to resolve quality problems and security issues
  • Choose partners with the higher capability of system development and service
  • Early project preparation
  • Manage project with partners
  • Training, pilot projects and management is the key to success
  • Control the printing quality of the census forms and census data filling quality
  • Project change control
prospect of the 2010 population census
Prospect of the 2010 Population Census

Census time: Nov. 1, 2010

  • Short form and long form, death population form
  • Foreigners living in China are considered to be enumerated

Data capture in 2011

  • OCR data capture will be the main data entry method
  • Modifying the existing system of agricultural census and make some innovate
  • Adding more OCR equipments