Crime Section, Central Statistics Office.

Case Study- Matching Criminal Justice Administrative Datasets in the absence of common unique identfiers Crime Section, Central Statistics Office.

Acknowledgments • The Crime Section would like to acknowledge the assistance provided by the Probation Service in this project. • In particular, we would like to thank Michael Donnellan and Aidan Gormley.

Areas of Discussion • Connectivity between the various Criminal Justice Database Systems • The Challenge - Absence of unique identifier • The Solution – CSO statistical matching. • Results of matching exercise • Future Goals

Connectivity between the various Criminal Justice Database Systems • Robust links between PULSE and CCTS. • Tenuous link between PULSE/CCTS and Probation • Need to make these links into strong links - but how?

The Challenge – Absence of common unique identifier. • Common unique identifier allows rapid integration of datasets. • The common identifiers between PULSE and CCTS include Charge No., Summons No. • These are linked to the Person PULSE ID in PULSE, to allow linking by individual. • Result: Able to produce statistics combining police and court outcome data. • However, there is a problem....

The Challenge – Linking Probation and PULSE data • No such common identifier between CCTS/PULSE and Probation • Probation Service uses its own unique identifiers. • No linking between this and PULSE identifiers such as Person PULSE ID and Court Outcome number. • Cannot link the datasets and cannot produce statistics.

The Challenge, and its solution • But a solution exists: • If persons in the separate systems can be matched across variables that exist in both systems: • Then a table linking unique identifiers can be produced. • Variables such as first name, surname, data of birth and address exist in both systems. • These can be used to link the two systems. • This is the basis of the CSO solution.

The Solution – CSO statistical matching. • The CSO received a test dataset from the Probation Service, for years 2007 and 2008. • Over 8700 data orders with corresponding info. • First, a manual matching exercise was carried out to test feasibility • Matching by first name, surnames, addresses, dates of birth on over 7800 probation records. • A random sample of 800 records • It took 8.5 person-days to process this 10% sample. • At this rate, it would have taken over90 days to process the entire dataset.

The Solution – CSO statistical matching. • The next step was to automate the matching process, for entire dataset. • Fully automated matching solution – not really possible. • A mixed-model method incorporating automatic and manual matching, to achieve 99% matching. • 70% of matches were automatically matched, without human role. • This match was on first name, surname and date of birth.

The Solution – CSO statistical matching. • Additional sorting/matching algorithms to simplify manual matching of remaining 28%. • There were four additional stages, with progressively increasing human role. • These were to identify cases where age or address data does not match, for example. • Processes still mainly automated and algorithm based, so fast to process. • The entire process was completed in 2man-day. 99% of all the records (7,800+) matched. • Compared to projected (90+ man days).

The Solution – CSO statistical matching. • Step one. • Both datasets sorted by names, addresses and dates of birth. NB All datasets shown are merely representations, not actual data

The Solution These are large datasets.

The Solution

The Solution – CSO statistical matching. • Step Two. • The probation and PULSE records are matched automatically by names and date of birth – using SAS. • 70% of entries are matched automatically, this way. • For each probation ID, the corresponding PULSE Ids are listed. • People may have multiple PULSE Ids, for each probation ID.

The Solution – CSO statistical matching (ctd.) • Step Three. • The next step is to ensure that surnames with the prefix “O’” are recorded in the same manner in both datasets • Step has minimal human involvement. • One dataset records “O’ ” as “O” • This is not detected or matched in initial stage • This can be performed with an automatic software “Replace” function • When the automatic matching (Step Two) is run again: • Now 85% of records match automatically.

The Solution – CSO statistical matching (ctd.) • Step Four • The next step is to match on cases where the surname and date of birth match, first names are closely related: • This step has more human involvement. Geographical info is used as a further check. This allows us to find aliases. • Example shown here: • It is clear that although “Liz” and “Elizabeth”, and “Alex” and “Lex” differ, they refer to same person.

The Solution – CSO statistical matching (ctd.) • Step Five. • Additional matching steps are then carried out. • One is to check for matching first names, surnames and geographical info, but where dates of birth differ. • Special checks can identify matching cases here. • Another set of checks involves searching for matching first name, date of birth but slightly different surnames. • All these steps lead to match of over 95%. • The final step is a fully manual operation to match the remaining 5%

Results • The CSO produced detailed results from this linkage. • Tables were produced showing: • Number of subsequent First Offices (recidivism), during the period 2008-11, by individuals with probation orders issued in 2007-08 • Table B: Subsequent First Offences (recidivism), during the period 2008-11, by individuals with probation orders issued in 2007-08, as percentage of the Original Primary Offence • Table C: Subsequent First Offence (recidivism) by individuals, during the period 2008-11, with probation orders issued in 2007-08 as a percentage of total original primary offences • Table D: Subsequent First Offence (recidivism) during the period 2008-11 of individuals with probation orders issued in 2007-08 as a % of total subsequent First Offences • Unfortunately, we can show only sample data here.

Future Goals • Further development of matching model. • To incorporate text analysis, fuzzy matching. • To develop a fully automatic process to match to 99%.

Conclusion • This project shows a simple, effective solution to integrating datasets in the absence of a common identifier. • This project doesn’t invalidate the importance of development of unique identifiers. • But it does allow matching of records where it is not feasible to retroactively apply any planned common identifier. • This method is not limited to Criminal Justice Administrative Data. • It can be applied to any datasets with common information on names, dates of birth etc.

Crime Section, Central Statistics Office.

Crime Section, Central Statistics Office.

Presentation Transcript

Production of Crime Statistics

Crime Statistics

Afghanistan’s Central Statistics Office Mapping Practice

Crime statistics

Juvenile Crime Statistics

Crime Statistics

CENTRAL STATISTICS OFFICE

Crime statistics in Sweden

Crime statistics

A N Majelantle Government Statistician Central statistics Office Botswana

Crime Statistics

Section 3: The Central Office Cost Center

Crime Statistics

Central Statistics Office

AP Statistics Section 9.3B The Central Limit Theorem

LSE CRIME STATISTICS

LSE CRIME STATISTICS

Central Statistics Office (CSO) Botswana

Crime Section, Central Statistics Office.

Section 3: The Central Office Cost Center

CENTRAL STATISTICS OFFICE IRELAND ITSIP PROJECT OVERVIEW

Crime statistics in Spain