190 likes | 373 Views
Proposals for linking Big Data and statistical registers Daniela Fusco* Tiziana Tuoto * Antony Rizzi** * Istat, Italian National Institute of Statistics ** Consiglio di stato. Summary. Introduction at the statistical use of Big Data The proposed Record linkage methods
E N D
Proposals for linking Big Data and statistical registers Daniela Fusco* Tiziana Tuoto* Antony Rizzi** *Istat, Italian National Institute of Statistics **Consiglio di stato
Summary Introductionat the statistical use of Big Data The proposed Record linkagemethods A case study First results Concludingremarks Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
Introduction • Big Data • Volume • Velocity • Variety • Statistical Registers • Volume • Velocity • Variety • Possiblesolution: • Reductioncosts • Enlargecontents • Timeless 1 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
STATISTICAL REGISTERS BIG DATA 2 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
Record linkageasweknow • - Record linkageis a classificationproblem • - The aimis to recognise the sameunitslocated in differentsourcesevenifrepresented in non-homogeneous ways • - Statistical methods for RL, Probabilistic RL, follow the classical approach due to Fellegi and Sunter (1969) and are now well established (Herzog, Scheuren and Winkler 2007) • - Software and tools to face with linkage problems • FEBRL (http://datamining.anu.edu.au/projects/linkage.html) • RELAIS (Record Linkage At Istat) http://www.istat.it/it/strumenti/metodi-e-strumenti-it/strumenti-di-elaborazione/relais 3 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
Record Linkageasweknowit: the phases • Conversion upper/lower case • Standardization • Parsing • Coding • Construction of derivedvariables An ONS report (Gill et al, 2001) describes • Pre-elaborations 2) Record linkage 3) Analysis • Select matching and blockingvariables • Edit and parse the variables • Block and sort the files • Blocking • Sortedneighbourood • Simhash • Canopy cluster • Hierarchicalgrouping • Select the method • Select the model/rules • Evaluate the model • Set the thresholds • Select the matching output • Fellegi & Sunter • Bayesian • Deterministic • 1 to 1 • Many to Many • 1 to many • Check by clerical • Evaluatinglinkageerrors 4 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017 • PRE-ELABORAZIONE • Trasformazione di maiuscole/minuscole • Trattamentodellestringhenulle • Standardizzazione • Parsing • … • PRE-ELABORAZIONE • Trasormazionedi maiuscole/minuscole • Trattamentodellestringhenulle • Standardizzazione • Parsing • …
Georeferencedapproach 5 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
Combinedmethodsapproach: the case study Aim: using Big Data to update the Farm Register, permitting the production and the periodical dissemination of statistics related to the activities and to the services offered by the Agritourism farms, at a minimum cost. Specifically, at the end of the integration process, it will be possible to: • Validate the addresses in SFR and identify them if they are missing; • Estimate the variables available on the net (e-commerce, price, etc.) to add other information in the SFR; • Check and integrate information of the SFR (telephone number, e-mail, web site, etc.). Sources: Italian Farm Register - Interned-scraped data Target: “hub”, website hosting and describing a plurality of units Size: 13,000 units in the FR – 7,000 unitsscraped from 3 hubs 6 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
Review: somethingabout the Farm Register Farm Register • Administrativesources: • Integrated Administration and Control System (IACS) • Animal register • Tax declaration on agricultural land • Land cadastre • Chambers of Commerce • Value Added Tax on agricultural income • Statistical sources: • Business Register • Agricultural Census • Survey on rural tourism accommodations • Survey on quality products 7 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
Number of Afs by mainsectorialHubs Variablesscraped by internet by topics 8 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
Combinedmethodsapproach: the case study Linkage models • Linking variables: denomination, address, longitude, latitude, postal code • Comparisonfunctions: Simhash, Jaro, 3grams, 3grams weighted by the frequency • Linkage Model: EM binomial and multinomial (5 and 8 classes) 9 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
Combinedmethodsapproach: the case study Multinomial EM Algorithm Traditionally the EM algorithm is applied to maximize the likelihood with two categories agree/disagree for each matching variable Here, we define k categories, k=5,8 where each category represents a class based on an interval of string comparators, in this case quantiles The EM algorithm under the multinomial distribution is used to estimate the match parameters for each variable q in class k 10 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
The result of the combinedapproach Denomination Address 11 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
Comparison Evaluation 12 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
The result of the georeferencedapproach in Emilia Romagna Region 13 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
Lessonlearnt The first evidence highlights the role played by the pre-processing phase and the data cleaning/reconciliation activity. It’s well known in official statistics, the preparation of input files is the first phase and requires 75% of the whole effort to implement a record linkage procedure, in this case the pre-processing step was particularly huge and expensive, requires almost the 95% of the whole time. Ignoring this task may compromise the effectiveness of the following steps. 14 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
Conclusions • The agricultural field is the most challenging area for evaluating the performance of new linkage methodologies, due to the well-known difficulties in recognising statistical units related to this field as well as rural addresses. • Dealing with new sources of data requires the availability of new methodologies in linking data, however the due attention should be devoted to the output quality evaluation, to better understand benefits and risks of the integration and to allow the analysts to take into account potential integration errors in subsequent analyses. • In this paper, we experiment and compare the use of GPS coordinates as matching variables, and for spatial linkage as well. Moreover, we introduce some machine learning algorithms in order to test their effectiveness to deal with un-structured data and the advantages of these algorithms with respect to traditional standardization and parsing activities on linkage variables. In addition, these features are compared with some innovations in the traditional approach to probabilistic record linkage. 15 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017
Nextsteps Wewillexplorenew solutions • We will assess the validation of the linkage results and the measurement of output quality 16 Proposal for linking Big Data and statisticalregisters , Daniela Fusco– Bruxelles, 14° March 2017