1 / 23

Census Bureau

Census Bureau. DRIS. Date: 01/23/2007. Oracle conversion example: Initial approaches for UCM source data. Index. Data Modeling Current Datafile Current Dataload Data Overlook Two Approaches First Approach Data Distribution Advantages Disadvantages. Second Approach Basic Modeling

lee-rocha
Download Presentation

Census Bureau

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Census Bureau DRIS Date: 01/23/2007

  2. Oracle conversion example:Initial approaches for UCM source data

  3. Index • Data Modeling • Current Datafile • Current Dataload • Data Overlook • Two Approaches • First Approach • Data Distribution • Advantages • Disadvantages

  4. Second Approach • Basic Modeling • Advantages • Advance Work • Care needed • Our Recommendation • Tasks • Next steps • ? Questions

  5. Data modeling • Conversion of data from Legacy (Fortran) to RDBMS (Oracle) • Hardware/software • Sun E6900, OS Solaris 5.10/12 cpu/96 G RAM • Database - Oracle 10g • Oracle designer / Erwin

  6. Current datafile Geo Base Data Big datafile Census Legacy process Reports Oracle db Data modeling Data Feeds Pl/SQL, Shell, C, ETL tool Data updates

  7. Current Dataload • UCM data • Fortran format • One big file w/ 180 M records • Record length is 1543 bytes • Most of the fields are varchar2 • Many fields are blank/no data • Performance in Oracle inadequate without schema redesign to leverage RDMS capabilities

  8. State of NY State of CA State of TX District of Columbia Delaware Connecticut 20 M 31 G 34 M 52 G 25 M 38 G 500 K 750 M 1 M 1.5 G 1 M 1.5 G Data Overview (approx)

  9. Two approaches • First Approach Break datafile on the basis of data • E.g. RO level (12) • State level (54-56), including DC, Puerto Rico etc. • Second Approach Break datafile into multiple tables with change in field definitions using relational model

  10. First approachBreak datafile on the basis of data Current datafile Table_CA Table_NY Table_XX Table_YY Table_54

  11. Data distribution • Uneven data distribution • Big data tables will be 30+ G • Small data tables will be close to < 1 G

  12. Advantages of this kind of segmenting/partitioning: • State level queries will be faster than current • If the data is separated by RO, the data will be more distributed w/ less tables (close to 12 instead 54-56)

  13. Disadvantages • Too many tables • Many fields are empty and varchar2(100) • No normalization • Existing queries need to be changed a lot • No normalization technique is used. For small tables, query will run fast but for big tables, there will be a lot of overhead • Operational tables will be same in number • Too complicated to run queries, may confuse users while joining main and operational tables

  14. Second approachBreak datafile into few relational tables with change in field definitions Current datafile MAFID Table1 MAFID Table2 Table3 MAFID MAFID MAFID MAFID Table4

  15. Basic Modeling • Database design/logical and physical • Relations will be defined based on a primary key • In this case, it will be MAFID, which is unique • varchar2(100) field could be converted to smaller fields based on actual field lengths • All fields will be mapped with at least one of the fields in the new tables • Data will be inserted in multiple efficient tables based on updated data model using relational database design principles

  16. Advantages • Faster • Queries • Updates • Deletes • Additions • Less maintenance • Same approach can be used for transactional/operational data

  17. Advance work • Identify each and every field of UNM data • Check/Define field lengths of each field • Map every field to new schema • Can some fields be merged together? • Identify and remove duplicate data elements in model • Define tables and relationships and create new schema • Break and load data into these tables

  18. Care needed • Current datafile will be broken into multiple datafiles for data processing • Load one by one datafile into tables • Test and demonstrate completeness of new model • Craft comparison to prove source and new schema properly include all Census data

  19. Our Recommendation • ** Second Approach ** • Why ? • Data distribution will be uniform • Less unwanted data is moved to separate tables • This will reduce overhead on the queries of any updates • Existing queries can be used with little modifications • Ongoing data maintenance will be more efficient in RDBMS • Additional data like RPS can be easily uploaded using same queries

  20. Tasks • Design database using data modeling tool/ Oracle designer / Erwin etc. • Create test data from original datafile • Load test data into database tables • Create test scripts to check data consistency • Check indexes for required queries • Test old data vs. new data

  21. Continued… • Break data into small files • Load full data into tables • Unit test on data for consistency • Run queries on the database • If needed, fine tune database • Use same approach for transactional data like RPS data

  22. Next steps… • Continued collaboration with Census team to improve domain understanding for new team members • Access to Oracle database tools on team’s workstations • Access to operational Oracle instance to begin development of approach

  23. ? Questions

More Related