1 / 36

Sharing Seminar: Address Matching

Sharing Seminar: Address Matching. GSS Best Practice and Impact (BPI). Welcome. About BPI What are sharing seminars? What is address matching? Agenda Questions and technical difficulties via Sli.do #T935. Ross Bowen – Valuation Office Agency. Really basic address matching.

atherton
Download Presentation

Sharing Seminar: Address Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sharing Seminar: Address Matching GSS Best Practice and Impact (BPI) Sli.do #T935

  2. Welcome • About BPI • What are sharing seminars? • What is address matching? • Agenda • Questions and technical difficulties via Sli.do #T935 Sli.do #T935

  3. Ross Bowen – Valuation Office Agency Really basic address matching Sli.do #T935

  4. Purpose – personalisation to increase response rates The Occupier, 1 Business Building, London, SW11 Dear Occupier, … Mr Bowen, 1 Business Building, London, SW11 Dear Mr Bowen, … Sli.do #T935

  5. Our method is primitive • Written in SAS - the code is dreadful to interpret, • Painfully slow (> 24hrs), • Lots of manual “edge-casing”, REPLACE “FST” with “FIRST”; REPLACE “1ST” with “FIRST”; etc. • Luck more than skill? • But it gets us somewhere (85% matches)… Sli.do #T935

  6. R/O GND FLR Flat 1, Fownes ST, LONDON, SW11 2TJ -- Capitalise string and extract the postcode using regular expression R/O GND FLR FLAT 1, FOWNES ST, LONDON [SW112TJ] -- Lots of replacement of common terms and abbreviations REAR 0TH FLOOR FLAT 1, FOWNES STREET, LONDON [SW112TJ] -- Remove punctuation and whitespace REAR0THFLOORFLAT1FOWNESSTREETLONDON [SW112TJ] Sli.do #T935

  7. -- If you get a perfect match, keep it. -- Otherwise, match on postcode and use Levenshtein distance: 0THFLOORFLAT1FOWNESSTREETLONDON [SW112TJ] REAR0THFLOORFLAT1FOWNESSTREETLONDON [SW112TJ] Levenshtein distance is the minimum amount of steps it would take to go from one string to another, editing one character at a time. In this case, L = 4. We look at this as a proportion of the length of the bigger string, = 4 / MAX(LENGTH(0THFLOORFLAT1FOWNESSTREETLONDON), LENGTH(REAR0THFLOORFLAT1FOWNESSTREETLONDON) = 0.1142857 -- Ignore matches with distance proportions over a tolerance = 0.4, -- Match everything you can, only using records once. Sli.do #T935

  8. Questions Sli.do #T935 Sli.do #T935

  9. Data science for address matching Iva Spakulova ONS Methodology 14 May 2018 Sli.do #T935

  10. What is the address index? Service that matches an input address string to a validated address and Unique Property Reference Number (UPRN) from Address Base (AB) Sli.do #T935

  11. Address Index matching process • 1. Parsing: Before attempting field by field linking, one needs to parse the free text input to tokens like building name and number, street name, town, locality, and postcode. For this purpose, we train a classification algorithm (Conditional Random Fields), and we also implement strategies to handle cases that are not perfectly parsed. • 2. Candidate address retrieval: Combination of structured and unstructured search isthen deployed using ElasticSearch to quickly compare the parsed input against 26 million addresses. • 3. Ranking and scoring: The service returns a short ordered list of candidate addresses (and UPRNs) including a measure of confidence in the presented results. Sli.do #T935

  12. 1. Parsing method “10 CHURCH HOUSE PARK ST BRIDES WENTLOOGE NEWPORT GWENT NP10 8SP” Rules based • Could use regular expressions or look-ups (for town names for example) Machine learning • Conditional Random Fields algorithm (Discriminative Undirected Probabilistic Graphical Model) Sli.do #T935

  13. Features (those in red relevant to example) 1. Digits - all, some, none 2. Word - word unless digits then false 3. Length - digits length / word length 4. Ends in Punctuation 5. Directional (e.g. South, N, NW) 6. Outcode/Incode (e.g. RH1) 7. Post Town (e.g. Newport) 8. Flat (e.g. appt, flat) 9. Company (e.g. CIC, CIO, LTD) 10. Road (e.g. road, rd, street, park, ffordd) 11. Residential (e.g. house, lodge, cottage, mews) 12. Business (e.g. office, hospital, care, bank) 13. Locational (e.g. basement, ground, top, lower) 14. Ordinal (e.g. first, 2nd) 15. Number of Hyphenations 16. Has Vowels 17. Word is at the start / end of the string “10 CHURCH HOUSE PARK STBRIDES WENTLOOGENEWPORTGWENTNP10 8SP” “10 CHURCH HOUSE PARKST BRIDES WENTLOOGENEWPORTGWENTNP10 8SP” “10 CHURCH HOUSE PARKST BRIDES WENTLOOGENEWPORTGWENTNP10 8SP” House number Building name Street name LocalityTown name CountyPostcode Sli.do #T935

  14. 2. Candidate address retrieval Elasticsearch is a fast, highly scalable open-source full-text search and analytics engine. It allows for complex search features and requirements. • Each parsed token of the input address is compared against relevant AB address fields • Matches allow for synonyms and fuzziness • Scores from individual matched fields are combined using custom query logic and boosting • Fall-back query on full address and bigrams Sli.do #T935

  15. 3. Ranking and scoring • The last step in the process is to evaluate and convey quality of the match between the input and the retrieved candidate address (its UPRN) to users. • A single confidence score is calculated using currently available information such as the Elasticsearch score, a bespoke rule-based score, parsing properties and the difference/ratio of scores between candidates. • We report the confidence score as a percentage, because it combines an intuitive measure (people understand how good 65% is) with something that allows automatic filtering to cut off the end of a results set. • The threshold value varies depending on the user case. For example, if an individual match is requested and reviewed by human, a good threshold would be 5% because it allows more candidates to be displayed if there is any ambiguity. For a purpose of automated matching, only very good matches should be returned and therefore the recommended threshold is 60%-80%. Sli.do #T935

  16. Address Index performance • Control over the input is limited and the performance depends strongly on the input quality • Goal: maximise the number of addresses for which the UPRN returned by Address Index matches the baseline reference while keeping the false positives (wrong matches) acceptably low • On baseline datasets created by the subject expert or provided by users a correct match rate of 97.5% has been achieved Sli.do #T935

  17. Address Index services RESTful API User web interface Bulk matching service The code is publicly available in github: https://github.com/ONSdigital/address-index-data and https://github.com/ONSdigital/address-index-api Sli.do #T935

  18. Questions #T935 Sli.do #T935

  19. Address matching:Connecting data sources with fuzzy addresses Peter Hufton Data Science Anna Carlsson-Hyslop Statistics & English Housing Survey Sli.do #T935

  20. MHCLG single departmental plan “Fixing our broken housing market” —Housing white paper, Feb 2017 “Our objectives: 1. Deliver the homes the country needs. 2. Make the vision of a place you call home a reality.” —MHCLG Single Departmental plan, May 2018 Sli.do #T935

  21. Objective: Connecting data sources Sli.do #T935

  22. Connecting different data sources English Housing Survey Energy Performance Certificate Land Registry Zoopla listing data • We have better data available to: • Move closer to understanding how the housing market operates in real-time. • Better predictive modelling of policy. • Better monitoring of the direct consequences of changes. identified by address ♫ ??? Address matching project ♪ Unique Property Reference Number(UPRN) Sli.do #T935

  23. Challenge: Addresses are fallible Sli.do #T935

  24. Addresses are fallible Sli.do #T935

  25. Addresses are fallible • Common problems include: • Spelling mistakes • Abbreviations “Co • Over-specification • Incomplete information • Outdated information • Errors • Addresses are fallible • An address typically arrives as an unstructured string. • – How do we determine which parts • are most important? “Flat 2, Green House, New Road, Neath, SA10 9XX”“Fflat 2, Ty Wyrdd, Heol Newydd, Castell Nedd, SA10 9XX” Sli.do #T935

  26. Address Matching outside of MHCLG Unique considerations: • Data sources: solution must tailored to our data sources; adaptable as our needs change • Scalability: Zoopla monthly update has a size of 120GB. • Licensing/sensitivity: solution must be developed in-house. Sli.do #T935

  27. Solution: An innovative approach to the problem of address matching Sli.do #T935

  28. A solution to address matching by Sli.do #T935

  29. How do we connect fuzzy addresses to the appropriate reference address? Sli.do #T935

  30. Machine-learning methods • What makes a well-matched address? “4 Kings Court, 123a Lordship Lane, East Dulwich SE22 1AB” vs “Flat 4 King’s Court, 123A Lordship Lane, London SE22 1AB” “4 Kings Court, 123aLordshipLane, East Dulwich SE22 1AB” vs “Flat 4 King’s Court, 123A Lordship Lane, London SE22 1AB” “4 Kings Court, 123a Lordship Lane, East Dulwich SE22 1AB” vs “Flat 4 King’s Court, 123A Lordship Lane, London SE22 1AB” Machine-learning programs ×1000s Sli.do #T935

  31. Okay: So how well do we do? Sli.do #T935

  32. How well do we do? consider a sample of addresses from NG18 English Housing Survey 3600 addresses Land Registry 12000 addresses Energy Performance Certificate Zoopla listing data Address matching project 98% with a single match (UPRN) 97% with a single match (UPRN) 88% Zoopla records can be connected to EPC Sli.do #T935

  33. Use cases for address matching for the English Housing Survey Sli.do #T935

  34. Conclusions • Conclusions: • Although addresses are fallible, we now have a robust solution for connecting ‘fuzzy’ addresses to their UPRN. • We can align now align data from disparate sources. • Address matching has specific uses in the English Housing Survey for: • Matching to council tax band • Matching to energy efficiency and usage • Reducing bias in the leasehold estimate Sli.do #T935

  35. QuestionsSli.do #T935

  36. Next Sharing Seminar: Power BI Please contact Elizabeth.Brankley@ons.gov.uk with any questions Sli.do #T935

More Related