Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Designing Linkage between Patents and Business Registers: the Italian Experience Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat) {ichim,perani,seri}@istat.it EESW European Establishment Statistics Workshop 2011 Neuchatel, 12 – 14 September 2011

EESW European Establishment Statistics Workshop 2011 Outline Project description Data sets Linkage approach Pre-processing of the input files Choice of the matching variables Choice of the similarity function Creation of the search space of link candidate pairs Choice of the decision model and selection of unique links Record linkage evaluation Preliminary results Future works

EESW European Establishment Statistics Workshop 2011 Project description • Aim: profiling the Italian patenting enterprises • Linking economic data and technological information on patenting enterprises in order to identify the key drivers of patenting propensity • Evaluating the economic impact of the patenting activity • Identifying and collecting additional information on enterprises to be surveyed as R&D performers • Investigating specific sub-population of enterprises (e.g. biotech enterprises)

EESW European Establishment Statistics Workshop 2011 Project description • Source of data: PATSTAT - EPO Worldwide Patent Statistical Database • Target data: applicants based in Italy • Period: patent applications from 1985 to 2010 • Subject classification criterium: • A) individuals • B) establishments • Business enterprises • Public institutions • Non profit institutions • Universities

EESW European Establishment Statistics Workshop 2011 Data sets: patents PATSTAT (1) Applications 299769 Application number (by year) International Patent Classification (IPC) code (each application can be classified under several IPC codes) PATSTAT (2) Applications 72034 Application number (by year) Applicant name Applicant code Postal/Zip Code Applicant Country (=IT) • Additional information toberetrievedfrom the above database: • Yearof first/last applicationbyapplicant • Numberofpatentapplicationsfiledbyapplicant • Regionof residence of the applicants

EESW European Establishment Statistics Workshop 2011 Data sets: enterprises Italian business register: ASIA (Archivio Statistico Imprese Attive) it is the frame for Istat surveys built as a logical and physical combination of data from both surveys and administrative sources (Tax Register, Register of Enterprises and Local Units, Social Security Register, Work Accident Insurance Register, Register of the Electric Power Board). ASIA Enterprises identification number Enterprises name Postal/Zip Code NACE code Address, municipality, province, region Legal form Fiscal code Enterprise’s size variables: Number of employees Turnover ASIA 1998-2008 (size 2008 ~ 4.5million records)

EESW European Establishment Statistics Workshop 2011 Data sets: linkage output Shared variables: Name Postal/Zip Code Enterprises identification number Applicant identification number Surveys

EESW European Establishment Statistics Workshop 2011 Pre-processing of the input files Standardisation: • Accents • Symbols & special characters • Double spaces • Dots (e.g. L.T.D. in LTD), punctuations • Known abbreviations (about 150 ways to say “in short”) • Most frequent words (more than 1000 and 100) • Lower/upper letters • Deduplication of words • Known legal forms (reduced to 6 main categories) • Universities/public administrations dropped

EESW European Establishment Statistics Workshop 2011 Choice of the matching variables • Std name in upper letter and alphabetical order • Postal/Zip code • Legal form

EESW European Establishment Statistics Workshop 2011 Search space reduction Patent applicants: Establishments (Enterprises) – Individuals - several words in a name (OK only for enterprises, not for individuals) Individuals: Std Applicant name does not contain - legal form - a name not included in the database of Italian first names “List of italian first names”* - special terms: “enterprise”, “construction”, “hotel”, “systems”, “group”, … (63 values) *(http://www.nomix.it/nomi-italiani-maschili-e-femminili.php)

EESW European Establishment Statistics Workshop 2011 Search space reduction • Blocking by year of application • (reduces only the size of the patent applicants archive: ineffective) • Blocking by Postal/Zip Code-Region (ineffective) • Partition of ASIA 2008 (more than 10 employees, 1 employee with legal form) • ASIA 2007-1998 (recursively removing the enterprises included in most recent ASIA archives) • R&D survey frame (as a subset of ASIA archive)

EESW European Establishment Statistics Workshop 2011 Search space reduction Neighbourhoods of words: the set of ASIA enterprises having at least one word in common with the patent applicant name Huge number of small problems!!!!

EESW European Establishment Statistics Workshop 2011 Search space reduction Neighbourhoods of words: Hypotheses: - assumes at least one word in a name registered at the same manner in both registers Problems: - very short words (1-2 letters) generate huge neighbourhoods - very common words generate huge neighbourhoods - names without neighbourhood - not applicable in a probabilistic approach * 23338 Patent applicants ~ ASIA 2008 (10+ number of employees)

EESW European Establishment Statistics Workshop 2011 Preliminary results Still under expert clerical check (~hundreds) No Duplicated Enterprises code

EESW European Establishment Statistics Workshop 2011 Preliminary results Patent applicants by year: lost and found (black and red)

EESW European Establishment Statistics Workshop 2011 Preliminary results Patenting enterprises in ASIA 2008by economic activity (NACE 2007) The 5 most frequent NACE’s divisions

EESW European Establishment Statistics Workshop 2011 Future Work • Methods • Neighborhood based on similarity instead of equality • Probabilistic approach (using the R&D survey frame) • Units • Names containing only 2 letters words • Individuals (names without legal form) • List of companies’ owners and partners • List of University Professors/Researchers • No neighbourhood names • Analyses • Produce analytical evidence on specific technological areas (e.g. Biotech) using ICP codes • Overall classification of patent applicants

EESW European Establishment Statistics Workshop 2011 Thank you for your attention!

Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)