The Statistical Administrative Records System and Administrative Records Experiment 2000: System Design, Successes, and Challenges Dean H. Judson Planning, Research and Evaluation Division U.S. Census Bureau
Outline of Presentation • General principles for using administrative records properly • Overview of StARS/AREX history, goals and design • Applications and evaluations: StARS 1999 and StARS 2000 versus Census 2000
How Administrative Records Are Created and Used Policy changes which change the definition of events and objects • “Ontologies” and thresholds • for observation Data collection Data entry errors and coding schemes Data management issues Query structure and spurious structure
Some Important Principles • Database Population ! • Database Truth ! • The “true” Data exist in the “real world”, as does the “true” Population. • But, the database gives us information that points to the Truth, and points to the Population.
“Current” employees of Company X, October 1, 2001 Resident U.S. Population on April 1, 2000 Population in StARS Database Population in Employee Database Accidental Duplication Accidental Duplication Non-U.S. Residents Deceased Terminated, not yet entered in database Oops! Accidentally included contractors!
“Real world” “Real world” “Real world” “Real world” Database Database Database Database Ontologies and Data Quality Incomplete Representation Proper Representation State 1 State 1 State 1 State 1 State 2 State 2 State 2 State 2 State 3 State 3 State 3 State 4 Ambiguous Representation Meaningless States State 1 State 1 State 1 State 1 State 2 State 2 State 2 State 2 State 3 State 3 State 4 Data Quality The function that maps from “real world” to database allows one to reconstruct the “real world” from the database values. Source: Wand and Wang, 1996:90
High Careful, well-done sample survey Intensity/Content of Data Collection Administrative Records/ Data Warehouse Low Low High Coverage of Target Population Coverage versus Intensity/Content:How can we get the best of both?
A Model for “Borrowing Strength” Original DW Database (X) “Ground Truth” Carefully Collected Data (Y) X Representative Sample of X Estimated Model: Y=f(X) Augmented DW Database, with X and estimated Y’s
Statistical Administrative Records System and Administrative Records Experiment
Background and History • Statistical Administrative Records System • Six large Federal input files: IRS 1040, IRS 1099, Selective Service, Medicare, Indian Health Service, HUD-TRACS/MTCS • One lookup file: SSA/Census NUMIDENT • AREX 2000 • Attempt to use StARS data to simulate administrative records census
What Was the Purpose of StARS 1999 and AREX 2000? • Test the feasibility of an administrative records census • StARS: Nationwide • AREX: two counties in Maryland, three in Colorado • MD 1.4M persons in 558K households • CO: 1.2M persons in 459K households • Test two methods for conducting an administrative records census • top-down method • bottom-up method (match to address list, add’tl operations)
Can We Do This? • Title 13, U.S. Code (§6, (a)-(c) abridged: • “The Secretary…may call upon any other department…of the Federal Government…for information pertinent to the work provided for in this title…To the maximum extent possible, the Secretary…shall use [such] information instead of conducting direct inquiries” • Privacy Act, 1974 (Title 5 §6, abridged): • “No agency shall disclose any record…unless…to the Bureau of the Census for purposes of planning or carrying out a census or survey or related [title 13] activity” • “Each agency that maintains a system of records shall…publish in the Federal Register upon establishment…the existence and character of the system of records” (Published StARS in FR , January 1999)
The Statistical Administrative Records System-1999 Address Processing 795,742,702 Hygiene & Unduplication 136,154,293 Person Processing 875,750,973 SSN Validation (PVS) 844,945,296 Valid (96.5%) Unduplication 279,601,038 Geocoding 102,965,122 (75.6% Coded) 33,189,171 (24.4% Uncoded) Gender Model Mortality Model TIGER Code 1 ABI Race Model TY98 IRS 1040 119,946,193 TY98 IRS 1099 598,075,971 Medicare 56,837,022 Selective Service 13,176,234 HUD TRACS 3,342,234 Indian Health Service 3,106,821 Census NUMIDENT 396,185,872 Edited IRS 1099 Edited IRS 1040 243,260,776 Edited Selective Service Edited HUD TRACS Edited Indian Health Service NUMIDENT 676,589,439 Person Characteristics File (PCF) 396,185,872 Remove Deceased/Create Composite Record 257,764,909 Invalid SSNs 30,805,677 (3.5%) Edited Medicare ? Research Extraction of AREX Test Site Records 1,459,760 in Baltimore Site 1,229,274 in Colorado Site
Statistical Administrative Records System-2000 (DRAFT) Hygiene & Unduplication 158,593,956 Geocoding 125,647,359 SSN Validation 895,196,891 Unduplication 289,968,449 Race Model Person Processing 905,432,071 Mortality Model TIGER/MAF Code 1 ABI Address Processing 725,230,009 Gender Model TY99 IRS IRMF 583,642,950 Census NUMIDENT 408,447,131 Edited IRS IMF 253,825,653 Edited HUD TRACS 1,991,655 Edited SSS 14,538,895 Edited Medicare 59,197,759 Edited IRS IRMF 568,109,788 TY99 IRS IMF 124,729,862 Medicare 59,198,432 Edited MTCS 6,208,615 HUD TRACS 1,991,672 Indian Health Service 2,730,407 Edited IHS 2,728,548 NUMIDENT 721,228,119 Person Characteristics File (PCF) 408,447,131 HUD MTCS 6,232,562 Remove Deceased/Create Composite Record 265,950,850 Invalid SSNs 10,235,180 Selective Service 13,370,053 ?
Administrative Records Experiment in 2000 (AREX 2000) • Five selected sites in Maryland and Colorado • MD: Baltimore city, Baltimore county; • CO: El Paso county, Douglas county, Jefferson county • Attempt to simulate an Administrative Records Census • Not all aspects of an Administrative Records Census are simulated • Group Quarters survey • Coverage measurement survey • Special operations not included in StARS • Request for physical address (PO boxes/Rural Route’s) • Clerical hand geocoding • Field verification of addresses not matched to DMAF
AREX 2000 Evaluations • Process: Analyzing selected components of the AREX implementation processing • Outcomes: Block level analysis: Age/Race/Sex/Hispanicity comparisons to Census 2000 • Household level analysis: • Comparing household distributions for matched addresses • Assessing the feasibility of using administrative records in lieu of a field interview to obtain data on nonresponding households • Available at www.census.gov/pred/www/rpts.html#AREX • (Synthesis of results from the Administrative Records Experiment in 2000)
Characteristics of Files Included in the StARS System • IRS Individual Master 1040 File: • Tax year data; April, 2000 refers to “tax year” 1999 • TY ‘99 file arrives October, 2000 • Business entities, estates, other institutions included • ~120 million return records/year; maximum of six person records per return • Households below the filing threshold do not need to file • Late filers systematically different than early filers • Tax Filing Unit Housing Unit: 10-20% of addresses are PO Boxes, business addresses, tax preparers (Czajka, 2000) • TY95+: SSN’s of dependents requested, recorded • .5% of primary filer, 1.6% of secondary filer, 3.4% of dependents’ SSN’s in error (Czajka, 1987) • Age, race, sex, Hispanic origin microdata not available
Characteristics of Files Included in the StARS System, cont. • IRS Information Returns Master File: • Tax year data; April, 2000 refers to “tax year” 1999 • TY ‘99 file arrives October, 2000 • Business entities, estates, other institutions included • ~700 million records/year • Recipient address Housing Unit • 10-20% of addresses are PO Boxes, business addresses, tax preparers • Extremely limited microdata content: Age, race, sex, Hispanic origin microdata not available; name information often truncated • Possible source of information on undocumented persons
Characteristics of Files Included in the StARS System, cont. • Selective Service File: • Requested 4/1/99(00) file “cut date” • ~13 million records • Registration required in 1940, suspended in 1975, resumed in 1980 • Presumably, males 18-25 are required to inform SSS when they move • Females, non-immigrant aliens, hospitalized, incarcerated, and institutionalized males, and members of the armed forces are exempt • Limited microdata content: Race, Hispanic origin microdata not available • Address information may not be current
Characteristics of Files Included in the StARS System, cont. • Medicare Enrollment Database (EDB): • Requested 4/1/99(00) file “cut date” -- current and historical Medicare enrollment (“Active” and “Inactive” cases) • ~ 40 million records at any one point in time • Recipient Address Housing Unit • Proxy recipients listed on the file (e.g., John Doe’s benefits c/o Jane Doe; John Doe’s benefits c/o nursing home) • Used in population estimates system for 65+ household population estimates • A small portion of records at any point in time are almost certainly deceased (Kim and Sater, 2000) • Coverage is high (93-102%) but not perfect and unevenly distributed geographically • “Snowbird” states appear to have lower ratios of Medicare to 65+ population than “non-snowbird” states (Kim and Sater, 2000)
Characteristics of Files Included in the StARS System, cont. • Indian Health Service patient file: • Requested 4/1/99(00) file “cut date” • ~10 million patient/transaction records • Transaction record person record • Unduplication • about 10 million patient records, 2 million unduplicated SSN’s • Many missing SSN’s (about 20%) • Integral part of our race model
Characteristics of Files Included in the StARS System, cont. • Housing and Urban Development Tenant Rental Assistance Certification System (HUD-TRACS/MTCS): • Requested 4/1/99(00) file “cut date” • HUD subsidy payments • TRACS 1999: ~ 3.3 million records • TRACS 2000: ~ 2 million records • Short form data for all members of household (Race/Hispanic only for head of household) • Address information may represent project or landlord address
Characteristics of Files Included in the StARS System, cont. • Census NUMIDENT File: • ~700 million transaction records 400 million individual SSN records • Post 1985: Enumeration at birth • For each SSN: Date of birth, gender, race, place of birth • About 50-60 million persons on the file are deceased but not identified as such • No current residence information on the file • Taxpayer ID Numbers (TINs) not on the file • Demographic properties: • About 35% of SSN’s on file have alternate names (marriage, divorce, etc.) • About 6% missing gender • Race coding has changed (prior to 1980, 3 races: White, Black, Other); 20% either “unknown” or “other” • About 25% of SSN’s have transactions with different race codes
Creating Final StARS Database • Select best address and demographics based on • geocodability • currency • quality • Impute missing demographics (from NUMIDENT/PERSON CHARACTERISTICS FILE) • Flag records for deceased people • Final database is like the census
Address Processing Results (StARS 1999) • Almost 800 million addresses at start • About 6 percent identified as potential businesses • 136 million address records after unduplication • About 75 percent geocoded • 85 percent geocoding rate for city-style addresses
Person Processing Results (StARS 1999) • 875 million records at start • 845 million have valid SSN record (96.5%) • 280 million after unduplication by SSN • 261 million after removal of known deceased • 257 million after removal of known deceased and persons residing in outlying territories • StARS 2000: 266 million after removal of known deceased before April 1, 2000 and persons residing in outlying territories
Additional Operations of AREX 2000 • Clerical geocoding • Request for physical address (for P.O. Boxes, Etc.) • Match to Decennial Master Address File • Field address verification
Major Analytic Issues with StARS Processing Ontologies The way in which an administrative agency “defines” the world may not match the way the Census Bureau “defines” the world, e.g., A delivery address suitable for receiving a payment check may not suffice for putting individuals at a street address Difficult to distinguish individual units within the Basic Street Address Race coding: Hispanic Origin is a separate race on NUMIDENT Transaction data person data How many names does a person have (and in what order)? Proxies – IRS & Medicare records JOHN WILSON The address is (presumably) for Mary Smith. John Wilson may or C/O MARY SMITH may not live there. 1004 LAUREL LANE ROCKMONT, MD 22345
Major Analytic Issues with StARS Processing, cont. Addresses that are difficult to place on the ground About 10 % of addresses are rural style PO Boxes: 45% for IHS, 9.5% for Medicare, 7.5% for IRS 1040, 6.8% for SSS, 3.8% for IRS 1099, .4% for HUD-TRACS (Huang and Kim, 2000) 1995 IRS/CPS match: 86.5% of tax return cases had the same address as residence address, 94% coded to same county (Sater, 1995) John Smith H&R BLOCK P.O. BOX 12 GREENWAY, MD 29752 Addresses with both business and residential components Dean H. Judson JUDSON OLD GROWTH LOGGING SERVICES 45850 BACKWOODS HIGHWAY BOONDOCKS, OR 96432
Major Analytic Issues with StARS Processing, cont. Unduplication and matching Addresses and personal characteristics are measured with substantial variation Often not obvious whether a particular pair of records represent a duplicate or not. Yet, with multiple files, unduplication decisions must be made. Address matching: 101 Elm Rd, # 1 97132 101 Elm St, apt 1 97701 Versus 101 Elm Rd, #1 97132 101 Elm St, apt 1 97132
Major Analytic Issues with StARS Processing, cont. Variations in data from different sources Of the 50% of SSN’s found on multiple files, about 1% have more than one gender recorded about 32% have multiple addresses about 2% have multiple races (Huang and Kim, 2000) “Imputation” from the NUMIDENT Many files have limited microdata. For those that are found on the NUMIDENT, we can “impute” microdata from the approximately equivalent NUMIDENT fields. Race Model (Bye, 1998,1999) Gender Model (Thompson, 1999) Mortality Model (Falkenstein, Resnick, and Judson, 2000) StARS 2002 “NUMIDENT Race Enhancement” Match NUMIDENT to Census 2000 Use Census 2000 race response to improve imputation model
Major Analytic Issues with StARS Processing, cont. Changing information states Distinct problem from “point in time” data collection Information states change over time/over databases Address information ages over time and varies over databases SAM SMITH SAM SMITH BOX 2 RURAL ROUTE 37 486 MAIN STREET WESTPORT, VA 32784 FAIRFIELD, VA 33412 (Dated 10/14/98 from Medicare) (From TY97 IRS file, filed sometime in 1998) Mortality information ages over time and varies over databases One database provides information about the other, provided that matching can be performed Data processing requires complex, and substantively important, decision logic at each step
Applications • SSN search and validation with GEOkey • Earlier: 90% found in validation step, 5% in search step • 2001 Evaluation: 92% found in search (with GEOkey) alone • Apparently, our computer search outperforms SSA manual system • CPS/NHIS/ACS to Census matching evaluations • Compare different race responses • Compare survey and Census coverage • Compare variations in Poverty estimates • Evaluation of synthetic estimation methods (Popoff, Judson and Fadali, 2001) • Multiple-system Estimation for coverage evaluation • Additional information to aid dual-system estimation (Asher and Feinberg, 2001) • Erroneous enumerations (Biemer, Brown, Wiesen, and Judson, 2001)
Applications • Nonresponse follow up (NRFU) substitution (’04 simulation test) • Imputation methods improvement (’04 simulation test) • Master Address File (MAF) targeting • Census unduplication confirmation • Population estimation (postcensal estimates) • Survey improvement (noninterview adjustments)
Evaluations • Numident/PCF 1998 versus 1998 National estimates (Miller, Judson and Sater, 2000) • State level comparisons of StARS 2000 versus Census 2000 • County StARS-synthetic methods versus county ratio estimates and Census 2000 • Detailed comparison by (fully crossed) age, race, sex, and Hispanic origin counts versus Census 2000, at the county level • AREX tract, block, household evaluations on February 19th
County StARS-synthetic methods versus 1999 Estimates versus Census 2000 % Hispanic (StARS 99 vs. 99 Estimates vs. Census 2000, selected counties where StARS and Estimates deviate by more than 4 percentage points, counties in Colorado) 90 80 70 60 StARS 99 50 Census 2000 40 99 Estimates 30 20 10 Counties in 0 which StARS 99 Bent is closer to Otero Kiowa Pueblo Chaffee Morgan Lincoln Garfield Costilla Mineral Phillips Conejos Crowley Fremont La Plata Huerfano Alamosa San Juan Archuleta Saguache Las Animas Census 2000 are marked with a star.
Fully crossed age, race, sex, and Hispanic Origin array(ARSH array) • For every county in the U.S., count the number of nondeceased persons by: • Single year of age (0,101+) • Race (four groups) • Sex (two groups) • Hispanic origin (Hispanic/non) • Potentially 102 x 4 x 2 x 2 = 1632 cells per county, 3141x1632 = 5,126,112 in the U.S. • Error Measures: • Simple difference (C-S) • Algebraic percent error (S-C)/C
Note: Each data point is a single county’s ARSH cell.
Note: Each data point is a single county’s ARSH cell.
Age/Sex distributions, selected counties in Texas Anderson County (N of Houston) Andrews County (Far west, NM border) Brazos County (W of Houston) Atascosa County (Southern part of state)
Concluding Thoughts Historians of science will say that there was an “explosion” of research into Administrative Records and Data Warehousing in the late 20th/early 21st century Using these databases in a statistically-principled way requires a new statistical paradigm: Not survey sampling per se Not econometric modeling per se Not coverage measurement per se Something new These databases have some similar, but many different data quality issues than usual survey or census data We are attacking these issues with real Census applications
For Further Reading Alvey, W., and Scheuren, F. (1982). Background for an Administrative Records Census. Proceedings of the Social Statistics Section. Alexandria, VA: American Statistical Association. Asher, J., and Feinberg, S. (2001). Statistical Variations on an Administrative Records Census. Proceedings of the Social Statistics Section. Alexandria, VA: American Statistical Association. Biemer, P., Brown, G., Weisen, C., and Judson, D.H. (2001). Triple system estimation in the presence of erroneous enumerations. Proceedings of the Social Statistics Section. Alexandria, VA: American Statistical Association. Under review at the Journal of Official Statistics. Bye, B. (1997). Administrative Record Census for 2010 Design Proposal, Final Report. Rockville, MD: Westat, Inc. Bye, B. (1998). Race and ethnicity modeling with SSA Numident Data: Interim report: File development and tabulations. Unpublished document available from the U.S. Bureau of the Census. Bryant, C. (1995). Comparing the LUCA address list to “local records.” Paper presented at the 1995 State Data Center Meeting, San Francisco, CA, April 4, 1995. Czajka, J., Moreno, L., and Schirm, A.L. (1997). On the Feasibility of Using Internal Revenue Service Records to Count the U.S. Population. Washington, DC: Mathematica Policy Research, Inc. Czajka, J. (1999). Can we count on administrative records in future U.S. Censuses? Presentation at the Bureau of the Census, December 15, 1999. Falkenstein, Matthew, Resnick, Dean R., and Judson, Dean. H. (2000). The Mortality Module of the Statistical Administrative Records System. Administrative Records Memorandum Series, U.S. Census Bureau. Farber, Jim, and Shaw, Kevin M. (2002). Dual System Estimates of Housing Units Based on Administrative Records. To appear in the 2002 Proceedings of the American Statistical Association, Government Statistics Section [CD-ROM], Alexandria, VA: American Statistical Association. Heimovitz, Harley K (2002). Administrative Records Experiment 2000: Outcomes. To appear in the 2002 Proceedings of the American Statistical Association, Government Statistics Section [CD-ROM], Alexandria, VA: American Statistical Association. Huang, E., and Kim, J. (2000). One Percent Sample Study Report (SRD-DRAFT). Unpublished document available from the U.S. Bureau of the Census, February 10, 2000.
For Further Reading Judson, D.H., and Popoff, C.L. (2000). Research Use of Administrative Records. University of Nevada: Nevada State Demographer’s Office. Judson, D. H. (2000). The Statistical Administrative Records System: System Design, Successes, and Challenges. Paper presented at the 2000 Data Quality Workshop, Morristown, NJ, Nov 30-Dec 1. Judson, D.H., Popoff, Carole L., and Batutis, Michael (2001). An Evaluation of the Accuracy of U.S. Census Bureau County Population Estimation Methods. Statistics in Transition, 5:185-215. Judson, D.H. (2001). A Partial Order Approach to Record Linkage. Paper presented at the Federal Committee on Statistical Methodology, Washington, DC, November 14, 2001. Judson, D.H. (2002). Adventures in Bayesian Record Linkage. Paper presented at the Classification Society of North America, June 11, 2002. Judson, Dean H. (2002). Merging Administrative Records Databases in the Absence of a Register: Data Quality Concerns and Outcomes of an Experiment in Administrative Records Use. Paper presented at the UNECE-EUROSTAT work session on registers and administrative records in social and demographic statistics, Geneva, Switzerland, 9-11 December 2002). Kim, M. O., and Sater, D. (2000). Defining the Medicare Data Universe for the U.S. Census Bureau's Population Estimates Program. Paper presented at the Southern Demographic Association meetings, New Orleans, LA, August 29, 2000. Leggieri, Charlene, and Prevost, Ron (1999). Expansion Of Administrative Records Uses At The Census Bureau: A Long-Range Research Plan. Paper presented at the November 1999 Meeting of the Federal Committee on Statistical Methodology, Washington D.C. Miller, E., Judson, D.H., and Sater, D. (2000). The 100% Census NUMIDENT: Demographic Analysis of Modeled Race and Hispanic Origin Estimates Based Exclusively on Administrative Records Data, Paper presented at the Southern Demographic Association meetings, New Orleans, LA, August 29, 2000. Popoff, C.L., Judson, D.H., and Fadali, Betsy (2001). Measuring the Number of People Without Health Insurance: A Test of a Synthetic Estimates Approach for Small Area Estimates using SIPP Microdata. Paper presented at the Federal Committee on Statistical Methodology, Washington, DC, November 14, 2001.
For Further Reading Sailer, P., Weber, M., and Yau, E. (1993). How Well Can IRS Count the Population? 1993 Proceedings of the Survey Research Methods Section. Alexandria, VA: American Statistical Association. Sater, D. (1995). Differences in Location of Households and Tax Filing Units. Paper presented at the 1995 meeting of the Population Association of America, San Francisco, CA, April 6, 1995. Stuart, E. and Zaslavsky, A.M. (2002). Using administrative records to predict census day residency. In Constantine Gatsonis, Robert E. Kass, Alicia Carriquiry, Andrew Gelman, David Higdon, Donna K. Pauler, Isabella Verdinelli (Eds.), Case Studies in Bayesian Statistics Volume VI. New York, NY: Springer. Thompson, Herbert (1999). The Development of a Gender Model with SSA Numident Data. Administrative Records Research Memorandum Series #32, U.S. Census Bureau. Wand, Y., and Wang, R. Y. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39: 86-95. Zanutto, Elaine, and Zaslavsky, Alan M. (2001). Using Administrative Records to Impute for Nonresponse. In R. Groves, R.J.A. Little, and J.Eltinge (Eds), Survey Nonresponse. New York: John Wiley.