1 / 19

Proposals for linking Big Data and statistical registers

Proposals for linking Big Data and statistical registers Daniela Fusco* Tiziana Tuoto * Antony Rizzi** * Istat, Italian National Institute of Statistics ** Consiglio di stato. Summary. Introduction at the statistical use of Big Data The proposed Record linkage methods

vanessa
Download Presentation

Proposals for linking Big Data and statistical registers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Proposals for linking Big Data and statistical registers Daniela Fusco* Tiziana Tuoto* Antony Rizzi** *Istat, Italian National Institute of Statistics **Consiglio di stato

  2. Summary Introductionat the statistical use of Big Data The proposed Record linkagemethods A case study First results Concludingremarks Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  3. Introduction • Big Data • Volume • Velocity • Variety • Statistical Registers • Volume • Velocity • Variety • Possiblesolution: • Reductioncosts • Enlargecontents • Timeless 1 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  4. STATISTICAL REGISTERS BIG DATA 2 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  5. Record linkageasweknow • - Record linkageis a classificationproblem • - The aimis to recognise the sameunitslocated in differentsourcesevenifrepresented in non-homogeneous ways • - Statistical methods for RL, Probabilistic RL, follow the classical approach due to Fellegi and Sunter (1969) and are now well established (Herzog, Scheuren and Winkler 2007) • - Software and tools to face with linkage problems • FEBRL (http://datamining.anu.edu.au/projects/linkage.html) • RELAIS (Record Linkage At Istat) http://www.istat.it/it/strumenti/metodi-e-strumenti-it/strumenti-di-elaborazione/relais 3 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  6. Record Linkageasweknowit: the phases • Conversion upper/lower case • Standardization • Parsing • Coding • Construction of derivedvariables An ONS report (Gill et al, 2001) describes • Pre-elaborations 2) Record linkage 3) Analysis • Select matching and blockingvariables • Edit and parse the variables • Block and sort the files • Blocking • Sortedneighbourood • Simhash • Canopy cluster • Hierarchicalgrouping • Select the method • Select the model/rules • Evaluate the model • Set the thresholds • Select the matching output • Fellegi & Sunter • Bayesian • Deterministic • 1 to 1 • Many to Many • 1 to many • Check by clerical • Evaluatinglinkageerrors 4 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017 • PRE-ELABORAZIONE • Trasformazione di maiuscole/minuscole • Trattamentodellestringhenulle • Standardizzazione • Parsing • … • PRE-ELABORAZIONE • Trasormazionedi maiuscole/minuscole • Trattamentodellestringhenulle • Standardizzazione • Parsing • …

  7. Georeferencedapproach 5 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  8. Combinedmethodsapproach: the case study Aim: using Big Data to update the Farm Register, permitting the production and the periodical dissemination of statistics related to the activities and to the services offered by the Agritourism farms, at a minimum cost. Specifically, at the end of the integration process, it will be possible to: • Validate the addresses in SFR and identify them if they are missing; • Estimate the variables available on the net (e-commerce, price, etc.) to add other information in the SFR; • Check and integrate information of the SFR (telephone number, e-mail, web site, etc.). Sources: Italian Farm Register - Interned-scraped data Target: “hub”, website hosting and describing a plurality of units Size: 13,000 units in the FR – 7,000 unitsscraped from 3 hubs 6 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  9. Review: somethingabout the Farm Register Farm Register • Administrativesources: • Integrated Administration and Control System (IACS) • Animal register • Tax declaration on agricultural land • Land cadastre • Chambers of Commerce • Value Added Tax on agricultural income • Statistical sources: • Business Register • Agricultural Census • Survey on rural tourism accommodations • Survey on quality products 7 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  10. Number of Afs by mainsectorialHubs Variablesscraped by internet by topics 8 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  11. Combinedmethodsapproach: the case study Linkage models • Linking variables: denomination, address, longitude, latitude, postal code • Comparisonfunctions: Simhash, Jaro, 3grams, 3grams weighted by the frequency • Linkage Model: EM binomial and multinomial (5 and 8 classes) 9 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  12. Combinedmethodsapproach: the case study Multinomial EM Algorithm Traditionally the EM algorithm is applied to maximize the likelihood with two categories agree/disagree for each matching variable Here, we define k categories, k=5,8 where each category represents a class based on an interval of string comparators, in this case quantiles The EM algorithm under the multinomial distribution is used to estimate the match parameters for each variable q in class k 10 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  13. The result of the combinedapproach Denomination Address 11 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  14. Comparison Evaluation 12 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  15. The result of the georeferencedapproach in Emilia Romagna Region 13 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  16. Lessonlearnt The first evidence highlights the role played by the pre-processing phase and the data cleaning/reconciliation activity. It’s well known in official statistics, the preparation of input files is the first phase and requires 75% of the whole effort to implement a record linkage procedure, in this case the pre-processing step was particularly huge and expensive, requires almost the 95% of the whole time. Ignoring this task may compromise the effectiveness of the following steps. 14 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  17. Conclusions • The agricultural field is the most challenging area for evaluating the performance of new linkage methodologies, due to the well-known difficulties in recognising statistical units related to this field as well as rural addresses. • Dealing with new sources of data requires the availability of new methodologies in linking data, however the due attention should be devoted to the output quality evaluation, to better understand benefits and risks of the integration and to allow the analysts to take into account potential integration errors in subsequent analyses. • In this paper, we experiment and compare the use of GPS coordinates as matching variables, and for spatial linkage as well. Moreover, we introduce some machine learning algorithms in order to test their effectiveness to deal with un-structured data and the advantages of these algorithms with respect to traditional standardization and parsing activities on linkage variables. In addition, these features are compared with some innovations in the traditional approach to probabilistic record linkage. 15 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  18. Nextsteps Wewillexplorenew solutions • We will assess the validation of the linkage results and the measurement of output quality 16 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017

  19. Thankyou for yourkindattention

More Related