Census Bureau

Census Bureau DRIS Date: 01/23/2007

Oracle conversion example:Initial approaches for UCM source data

Index • Data Modeling • Current Datafile • Current Dataload • Data Overlook • Two Approaches • First Approach • Data Distribution • Advantages • Disadvantages

Second Approach • Basic Modeling • Advantages • Advance Work • Care needed • Our Recommendation • Tasks • Next steps • ? Questions

Data modeling • Conversion of data from Legacy (Fortran) to RDBMS (Oracle) • Hardware/software • Sun E6900, OS Solaris 5.10/12 cpu/96 G RAM • Database - Oracle 10g • Oracle designer / Erwin

Current datafile Geo Base Data Big datafile Census Legacy process Reports Oracle db Data modeling Data Feeds Pl/SQL, Shell, C, ETL tool Data updates

Current Dataload • UCM data • Fortran format • One big file w/ 180 M records • Record length is 1543 bytes • Most of the fields are varchar2 • Many fields are blank/no data • Performance in Oracle inadequate without schema redesign to leverage RDMS capabilities

State of NY State of CA State of TX District of Columbia Delaware Connecticut 20 M 31 G 34 M 52 G 25 M 38 G 500 K 750 M 1 M 1.5 G 1 M 1.5 G Data Overview (approx)

Two approaches • First Approach Break datafile on the basis of data • E.g. RO level (12) • State level (54-56), including DC, Puerto Rico etc. • Second Approach Break datafile into multiple tables with change in field definitions using relational model

First approachBreak datafile on the basis of data Current datafile Table_CA Table_NY Table_XX Table_YY Table_54

Data distribution • Uneven data distribution • Big data tables will be 30+ G • Small data tables will be close to < 1 G

Advantages of this kind of segmenting/partitioning: • State level queries will be faster than current • If the data is separated by RO, the data will be more distributed w/ less tables (close to 12 instead 54-56)

Disadvantages • Too many tables • Many fields are empty and varchar2(100) • No normalization • Existing queries need to be changed a lot • No normalization technique is used. For small tables, query will run fast but for big tables, there will be a lot of overhead • Operational tables will be same in number • Too complicated to run queries, may confuse users while joining main and operational tables

Second approachBreak datafile into few relational tables with change in field definitions Current datafile MAFID Table1 MAFID Table2 Table3 MAFID MAFID MAFID MAFID Table4

Basic Modeling • Database design/logical and physical • Relations will be defined based on a primary key • In this case, it will be MAFID, which is unique • varchar2(100) field could be converted to smaller fields based on actual field lengths • All fields will be mapped with at least one of the fields in the new tables • Data will be inserted in multiple efficient tables based on updated data model using relational database design principles

Advantages • Faster • Queries • Updates • Deletes • Additions • Less maintenance • Same approach can be used for transactional/operational data

Advance work • Identify each and every field of UNM data • Check/Define field lengths of each field • Map every field to new schema • Can some fields be merged together? • Identify and remove duplicate data elements in model • Define tables and relationships and create new schema • Break and load data into these tables

Care needed • Current datafile will be broken into multiple datafiles for data processing • Load one by one datafile into tables • Test and demonstrate completeness of new model • Craft comparison to prove source and new schema properly include all Census data

Our Recommendation • ** Second Approach ** • Why ? • Data distribution will be uniform • Less unwanted data is moved to separate tables • This will reduce overhead on the queries of any updates • Existing queries can be used with little modifications • Ongoing data maintenance will be more efficient in RDBMS • Additional data like RPS can be easily uploaded using same queries

Tasks • Design database using data modeling tool/ Oracle designer / Erwin etc. • Create test data from original datafile • Load test data into database tables • Create test scripts to check data consistency • Check indexes for required queries • Test old data vs. new data

Continued… • Break data into small files • Load full data into tables • Unit test on data for consistency • Run queries on the database • If needed, fine tune database • Use same approach for transactional data like RPS data

Next steps… • Continued collaboration with Census team to improve domain understanding for new team members • Access to Oracle database tools on team’s workstations • Access to operational Oracle instance to begin development of approach

? Questions

Census Bureau

Census Bureau

Presentation Transcript

U.S. Census Bureau

U.S. Census Bureau

Transforming Census Bureau Operations

U.S. Census Bureau census.gov

Bureau Of Statistics and Census

US Census Bureau

Robyn Sirkis U.S. Census Bureau

U.S. Census Bureau

Census Bureau and Urbanized Areas

Census Bureau

Census Bureau

US Census Bureau

U.S. Census Bureau

Robert Kominski, U.S. Census Bureau Diana B. Elliott, U.S. Census Bureau

Census Bureau – Fernando Casimiro, Coordinator

U.S. Census Bureau

Census Bureau

Source: U.S. Census Bureau

U.S. Census Bureau