1 / 44

Geocoding from Postal Codes: Data Service Providers' Guide to PCCF and PCCF+

Learn about the benefits, methods, and limitations of geocoding with postal codes using PCCF and PCCF+. Discover how small area geography data can be used in various fields including health, social sciences, education, and more.

troiano
Download Presentation

Geocoding from Postal Codes: Data Service Providers' Guide to PCCF and PCCF+

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Geocoding from postal codes: what data service providers need to know about PCCF and PCCF+Russell WilkinsHealth Analysis Division, Statistics Canada, Ottawa, and Department of Epidemiology and Community Medicine, University of Ottawa ACCOLEDS 2011 Kwantlen Polytechnic University Richmond (BC) Campus, Wednesday 29 November 2011

  2. Introductory remarks • Postal codes are part of nearly every administrative and research data set, and postal code conversion using PCCF or PCCF+ and related tools is now the usual way of exploiting their rather impressive potential. • The resulting small area geography and/or latitude-longitude coordinates have a wide variety of possible uses, even where individual measures of SES are also available on a data set. • Familiarity with the methods (tools and techniques), as well as the strengths and limitations of dealing with postal coded data, will allow data service providers to help users to more meaningfully exploit their potential.

  3. Publications using PCCF & PCCF+ (adapted from Peter Peller, U Calgary, forthcoming) • Health and related • Public Health, Epidemiology, Health Information, Health Policy, Environmental Health, Nursing, Gerontology, Medicine, Psychiatry, Mental Health, Women’s Health, Early Childhood Development, Pharmaceutical Sciences, Veterinary Medicine • Social Sciences and Economics • Geography, Sociology, Demography, Social Work, Criminology, Political Science, Economics • Education, Data Access, Statistics • Education, Adult Training, Physical Education, Data Librarianship, Research Data Centres, Statistics • Other • Earth Sciences, Agriculture, Forestry, Finance , Business, Law, Engineering, Transportation Studies

  4. Possible uses of small-area data • Add policy relevance by aggregating to administrative areas, health planning units, school districts, etc. • Deal with changes over time: newly created geographic units and revised boundaries (amalgamations, splits, etc.) • Assign neighbourhood SES and other characteristics (as determinants or confounders)—often using census data • Analysis by community characteristics (not necessarily from census) • water supply, air pollution, UV radiation, social cohesion, access to services, parks, urban-rural-MIZ, segregation, etc. • Determine point-to-point distance, road distance, travel time • To permit studies of migration over time (for exposure or SES histories, or for better access to services, etc.) when longitudinal files are available • Help impute missing data for income, ethnicity • Additional identifiers for record linkage purposes

  5. Supplementary uses of automated coding from postal codes • To identify records where the postal code does not refer to a place of usual residence (commercial buildings; children’s aid society offices; public trustees; social services headquarters, etc.) • To help identify residents of hospitals and long-term care facilities, prisons, university residences, etc., who may need to be treated as special cases

  6. Examples from four studies • Trends in mortality by neighbourhood income in urban Canada • Distance to nearest school, and university participation • Environmental exposures – Sidney Tar Ponds, etc. (Cape Breton) • Predicting auto parts requirements

  7. Remaining life expectancy at age 25, by income adequacy quintile, Canada, 1991-2001, compared to urban Canada by neighbourhood income quintile, 1996 Males Females Source: Wilkins R, Tjepkema M, Mustard CM, Choinière R, Health Reports, 2008.

  8. Distance to post-secondary education • Marc Frenette. Too far to go on? Distance to school and university participation. Research Paper Series, Analytical Studies No. 191. Ottawa: Statistics Canada catalogue 11F0019 No. 191, 2004. • http://www.statcan.ca/english/research/11F0019MIE/11F0019MIE2002191.pdf

  9. Data / Methods / Findings • Survey of labour and income dynamics (SLID) 1993-1998 (postal codes while in high school); List of university postal codes; PCCF+ • After controlling for family income, parental education, and other factors associated with university participation, students living ‘out-of-commuting distance’ were far less likely to attend university than students living within commuting distance (<40 km). Dose-response by distance.

  10. Sidney tar pondsenvironmental health study • Geographic links directly from addresses, so increased resolution for a small urban area where block face coding not available on PCCF • Illustrates GIS-based approach • Events assigned to latitude and longitude • Street network and pollution overlays • Air photo and satellite images integrated

  11. Predicting auto parts requirementsin each area – a business use of geocoding • Vehicle Identification Numbers (VIN) • make, model and year of each car • linked to postal code of current owner (& km?) • A private consulting firm obtained all 7+ million Canadian VINS + postal codes from provinces • Used PCCF+ to generate LL and census geocodes, to determine vehicle fleet mix in each area • Used repair histories to predict component failures in local vehicle fleets • Goal: to inform dealers of parts they need to stock

  12. Typical cases • User has access to a data file containing records of individuals (students / clients / patients / survey participants) together with the postal codes of their place of residence (or place of business) • Wants to exploit some or all of the potential uses of postal codes and small-area geography, as described above.

  13. Fitness for a given use • The user’s file may or may not already contain geographic coding of some sort. • If so, are those variables suitable for his or her use? • If not, can new variables be generated which are suitable for his or her use?

  14. Limitations of the existing codingwhich may already be present • Insufficient documentation available • Vintage of coding standard not specified (don’t assume it) • Where multiple links possible, the method of assignment not documented (SLI, probabilistic?) • Diagnostic codes (warning flags) not included • Problematic geographic coding not identified (business or government addresses, PO boxes, rural postal codes) • Potentially troublesome cases not identified (school or university residences, hospitals, nursing homes, prisons, shelters) • Geographic coding available not suitable • not available at level needed • not of correct vintage • too imprecise or inaccurate for intended use

  15. Geocoding strategies • Only postal codes available • Use PCCF or PCCF+ to assign geocodes, etc. • Full street addresses available (≠ concess road+lot no) • Look up postal code (if missing or to validate) • GIS may be able to find LL, etc. from street address • Telephone numbers available • Reverse lookup to get address & postal code • 911 system has LL point location and maps • Township, range, meridian and section • Rural landowners on prairies usually know this • Would require GIS to convert to census geocodes

  16. Unambiguous naming convention: geoYYuid • geo => geographic level in census hierarchy • DA, CT, CSD, CMA, etc. • YY => vintage of census geography required • DA01uid ≠ DA06uid (≈ 30% changed) • uid => unique identifier • higher levels always needed with ‘geo’ • DA=PR(2)+CD(2)+DA(4)=8 digits, not just last 4

  17. Tools and reference files available • PCCF, PCCF+ documentation (+ .ppt) • Geographic relationships (GTFyy) • GeoSuite 2006, 2001, 1996 (aka GeoRef 1996) • Geographic Attributes File 1991, 1986 • Geography Tape File 1981, 1976, 1971 • EA or DA + CT profiles for each census • Neighbourhood income quintiles 1981-2006 • Immigrant terciles, 2006 • Inter-censal EA/DA translations 1981-2006

  18. Why PCCF+?Canadian postal codes can be tricky • Population weights • Diagnostics • Imputations • Supplemental codes • Reproducible, documented processing

  19. Major problems which are dealt with by PCCF+ • Postal codes serving several DAs or blocks (especially in rural areas) • Postal codes used by businesses or public institutions • Postal codes which the regular PCCF only links to post office geography (rather than place of residence or business)

  20. Documentation • Many choices are required when geocoding records based on postal codes • SLI or WCF? Business codes OK? What vintage of census geographic codes, of health region definitions, etc.? • PCCF+ fully documents each of those choices (in the User Guide), and writes the version (eg “R5J”) and problem/diagnostic codes to the output file (so retain it)

  21. ID (<=12), PCODE PR, CD, CSD, CCSD CMA, CT, MIZ, ER DA, BLK; DA06UID BLKURB*, DPL* LAT, LONG HR, SUB, AHR, ASUB QAIPPE, IMMTER CSIZE, NSREL, AIRLIFT, AR EA81UID-EA96UID, DA01UID DMT, DMTDIFF LINK (PROB) SOURCE NCSD, NCD RPF, SERV, PREC BLDG NAME+ADR CSDNAME+TYPE CPCCODE RESFLG, INSTFLG Coded output files(HLTHOUT+GEOPROB)GEOG CODING DIAGNOSTICS * Poorly coded, not recommended. Geocodes within rectangle also produced by regular PCCF.

  22. Pitfalls of automated coding: some examples (1) • Problem:In a study of psychiatric problems among Manitoba children, dozens of children had the same downtown Winnipeg postal code. • Diagnosis: Examination of the building name and address showed the postal code referred to the office of the provincial trustee responsible for minor children in provincial care. Use of the geography and neighbourhood characteristics associated with that postal code would have seriously biased the study results. • Solution: Most non-residential postal codes including those for government and institutions can be identified by looking at the building / organization name and address in the PCCF+ problem output. Theneither find the postal code for the true place of residence (if appropriate re study aims) or set geography to missing (as was done for this study).

  23. Pitfalls of automated coding (2) • Problem:In an early study of Quebec births, many births were for mothers with the same few urban postal codes. The delivery mode type of those postal codes was not B (for large apartment buildings). • Diagnosis:It was determined that missing postal codes were being administratively assigned the postal code of the hospital of birth, so that health region could be assigned even though the mother’s postal code was unknown. Use of the associated small-area geography and/or neighbourhood characteristics would have systematically biased the results. • Solution:The PCCF+ problem output helps you to identify postal codes for hospitals, which should not be accepted as the place of residence of the mother. Theneither use the address information (if available) to find the mother’s own postal code or set geography to missing (as was done for this study).

  24. Pitfalls of automated coding (3) • Problem:In an early study using BC vital statistics data with nearly 100% presence of full postal codes, we were coding many deaths as residents of Montreal, Quebec, although the decedents had been born in other provinces or countries, and the provincial municipal coding showed BC place of residence. • Diagnosis:The non-existent postal code H0H0H0 (ho-ho-ho!) was being assigned when no postal code was reported. PCCF+ imputed geography from partial postal codes, although error codes were also assigned. • Solution: The full address was used to find a real postal code, or to assign geography manually if no postal code could be found.

  25. Pitfalls of automated coding (4) • Problem:In a study set in the Kingston area, many health events were for a relatively few postal codes, which were not known to be hospitals or long-term health care facilities. • Diagnosis: Closer examination showed them to be for prisons and university residences. • Solution:Use the PCCF+ problem output to systematically identify such cases, and depending on the purposes of the study, decide whether or not to use such cases in the analysis. (Note: The smaller the study area, the greater the potential impact of such problems.)

  26. Pitfalls of automated coding (5) • Problem:In various studies, postal codes for businesses keep appearing in the field for place of residence, apparently not due to keying errors. • Diagnosis: Likely a small but non-negligible proportion of persons either prefer to receive correspondence at their place of work, or mistakenly report the wrong postal code. • Solution:The PCCF+ problem output helps to identify postal codes for non-residential addresses. Try to recode based on street address, or based on postal code reported on other records for the same person.

  27. Pitfalls of automated coding (6) • Problem:In a Nova Scotia study of socio-economic differentials in mental health based on person-oriented hospital data, the neighbourhood SES of the mentally ill, as determined from their current postal code, tended to decline over time. • Diagnosis: Use of current postal code to assign neighbourhood SES would risk confusing cause with effect. • Solution:In person-oriented analysis, assign neighbourhood SES based on postal code at initial hospitalization or diagnosis.

  28. Pitfalls of automated coding (7) • Problem:Some studies require geographic coding of business and industrial locations, including mines, manufacturing establishments and dumpsites. • Diagnosis: The locations of such sites could be anywhere in the service area of postal code, unrelated to population distribution. • Solution:The population-based assumptions on which resolution of multiple matches are made using PCCF+ are simply not appropriate for coding in such cases, nor is SLI-based coding. Consider alternate coding methods based on nearest road intersection, retrieval of latitude and longitude information from other files, or use of GPS.

  29. SLI vs Population-weighting • Almost all rural postal codes and several categories of “urban” postal codes (DMT H, J, K, T, X) provide service to multiple DAs and CSDs, etc. • SLI=1 forces any occurrence of a particular postal code to match to only 1 set of geocodes. • Population weighting assigns each record with a particular postal code probabilistically, using population weights derived from the census, to one of the possible DAs, etc.

  30. Use of SLI for residential coding introduces systematic bias • Most DAs in rural postal coded areas can never be coded • Many CSDs in rural areas can never be coded • A high proportion of the population in rural areas will be systematically miscoded (to wherever the SLI is situated)

  31. Implications of such systematic biases introduced by use of SLI • Serious numerator-denominator mismatch whenever census population (denominator) data are required • “Hot spots” surrounded by “cold spots” • Over-coding of UARA classification of “urban” (BLKURB, based on block-level density in rural village centres)

  32. When is forced 1:1 coding from postal codes acceptable? • For distance calculations, where all you really need is a single representative average location in the service area of the postal code. • For calculations of rates based on denominators derived from the same file as the numerators, so that the coding errors will be in balance (systematically biased by the same amount in both the numerator and denominator). Example: for birth outcomes other than fertility rates. • For calculation of rates based on denominators derived from another postal coded file which was processed in the same way, such as a provincial health insurance master beneficiary file. • But you always need to check for non-residential (business-only) postal codes, and perhaps impute for partially incorrect codes, etc.

  33. Code your data only once,but analyse them many times • Be sure to correct all serious problems identified by the PCCF+ automated coding. It usually takes a couple of iterations to get the whole file clean. • The importance of the problems identified by the PCCF+ diagnostic codes depends on the data set and on the analyses to be done. Retain the diagnostic codes! • Once coded, the same dataset can be used for various kinds of studies (eg SES disparities, access to services, environmental health).

  34. Dealing with different vintages of census geography • Example: to compare age-specific death rates for 1985-87 vs 1995-97 vs 1999-2002 • PCCF+ Version 5x automatically codes to EA81UID, EA86uid, EA91uid, EA96uid, DA01uid, DA06uid • Problem: You need other levels of census geography (CMA, CT, CD) • Solution: • Use GTF86, GTF96, GAF01 to get higher levels of geography • Use DLI to get earlier census population age-sex distributions

  35. “Linking” to other geographically-coded data using SAS merge; by GEOuid • Select a level: DA, CT, CMACA, etc. • Choose the census/other variables of interest • census profiles available at all geographic levels • Sort both files by GEOYYuid, then • Merge by GEOYYuid • Rewrite your file with the new variables appended • Also for linking to geographically-oriented administrative data

  36. Calculating distances • Latitude & longitude for each record • Sample distance calculation shown in GEORES5x.SAS (find “calculating”) • Supplemental program DIST5x.SAS for many-to-many distances • GIS programs also do this, but with many more options

  37. DIST5x.SAS • /* CALCULATE DISTANCES FROM EACH OF MANY EVENTS (E) */ • /* TO THE NEAREST SERVICES (H) BY SPECIALTY */ • /* READS IN A FILE OF EVENTS CODED BY PCCF+ (GEORES5x) */ • /* AND A FILE OF SERVICES CODED BY PCCF+ (GEOINS5x) */ • /* OUTPUTS A FILE OF EVENTS WITH APPENDED DISTANCES */ • /* TO THE NEAREST SERVICES BY SPECIALTY */ • /* NOTE: */ • /* EVENTS FILE ASSUMED TO BE OUTPUT OF GEORES5x */ • /* WITH SPECIALTY CODE SOMEWHERE IN FILE */ • Distance to nearest hospital with obstetrician, variable for study of birth outcomes in BC (Luo, Kierans et al, Epidemiology 2004); • Distance to school and university participation (Frenette, ASB 2002) • Distance to nearest hospital, distance to nearest MD (Ng et al, Amankwah)

  38. Misclassification • In rural areas (and urban fringe) only, DA is assigned probabilistically—leading to random misclassification of DA and associated neighbourhood income quintile (QAIPPE). • => reduced ability to detect effects in rural areas (lower RRs, RDs), but almost no impact in urban areas • So be very careful in interpreting the expected lower effect estimates for rural vs urban areas. Such results may disagree with individual measures of SES. • Working paper showing extent of misclassification and impact on RRs, plus correction factors which could be applied to help compensate for the misclassification.

  39. Misclassification of income quintile in rural areas • Neighbourhood income quintiles derived from Canadian postal codes are apt to be misclassified in rural but not urban areas. • The extent of the misclassification has been evaluated, and a method of correction developed. • The correction is of little effect in urban areas, but of considerable effect in rural areas. • Wilkins R. HAMG working paper, 2004-08-25 Draft.

  40. Movement of postal codes • Many technical changes to address ranges • Usually no change of blockface or block LL • Very little change at higher levels (DA, CT etc) • Movement always within same FSA service area • Some reuse of retired postal codes within same FSA; if so, DMT may also change • However, two complete FSAs in BC moved by Canada Post during mid-1990s* • Moral: Code data as received; retain results

  41. Concluding remarks • Small area geography and/or latitude-longitude coordinates are increasingly becoming a part of most administrative and research data sets and are useful to at least some extent in many types of studies, even where individual measures of SES are available. • Familiarity with the methods (tools and techniques), as well as the strengths and limitations of dealing with such data, will allow data service providers to help their clients to more meaningfully exploit the potential of the data. • But like with other methods, it’s not enough to just use the methods mechanically. Data users need to think through what they’re doing and why.

  42. Tools / Technical References • Wilkins R. PCCF+ Version 5J User’s Guide. Statistics Canada, 2011. • Gonthier D et al, Merging area-level census data with survey data in STC RDCs. ITB: the Research Data Centres Information and Technical Bulletin (12-002) • Wilkins R. Neighbourhood income quintiles derived from Canadian postal codes are apt to be misclassified in rural but not urban areas. HAMG internal report, 2004.

  43. Russell Wilkins Health Analysis Division Statistics Canada, RHC-24A 100 Tunney’s Pasture Driveway Ottawa ON K1A OT6 Tel: 1-613-951-5305 Fax: 1-613-951-3959 Email: russell.wilkins@statcan.gc.ca

More Related