1 / 41

Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros. 1. Outline. Why use data sets from public offices? Three example of available Swedish datasets Workplace and household data In-patient data Data of suspected criminals

Download Presentation

Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros 1

  2. Outline • Why use data sets from public offices? • Three example of available Swedish datasets • Workplace and household data • In-patient data • Data of suspected criminals • Problems with Swedish public office data

  3. Sociological data • Expensive to collect • Time consuming (Especially time series) • Low response rate • Network data are associated with special problems

  4. Sampling of Network Data

  5. We can’t use a random sample

  6. Extracting data from existing databases!

  7. Sweden may be seen as an outlier when it comes to available public data • 1686 All priests was ordered to keep track of all people living in their parishes (We had a state church until 2000 in Sweden) • 1749 First census • 1756 Foundation of the governmental office ”Tabell kommisionen” (Sweden and Finland) • 1858 Foundation of Statistics Sweden SCB (www.SCB.SE)

  8. All individuals officially living in Sweden have an unique identifier ”personnummer” 700209-0960

  9. Example 1 The Sweden database

  10. The network • Individuals 8,861,392 • Families 4,641,829 • Workplaces 437,936

  11. Giant component 5 942 389 Average path distance 8.5 Diameter 22

  12. Send home (or vaccinate) everyone except max size of workplace

  13. Send home people randomly

  14. Average path distance

  15. Example 2 Data about suspected criminals

  16. The data • All individuals that have been registered as suspected for having committed a criminal act for every year between 1997 and 2005 • Total number of suspected individuals: 348 402 • Types of crimes: 144 • Total number of reported individual crimes:924 783 • Average number of suspected crime types per individual: 2.65 • Standard deviation of number of suspected crime types per individual: 3.3

  17. Purpose • Can social network visualization tools help us to give a better sense of how different crimes are related to each other?

  18. Basic concepts • Node: A specific type of crime. (For example, • “Assualt, outdoors, against child 0-6year of age, unacquainted with the victim” • “Trafficking for sexual purposes “ • Link: Exists between two types of crimes if at least one individual have been suspected for both crimes different years

  19. Bank 2002 “Robbery, with firearm, (Bank)” Post 2005 “Robbery, with firearm, (Post)” Example

  20. The mess of all violent crimes

  21. A minimum spanning tree

  22. 1 3 2 2 What is a minimum spanning tree? 4 6 5

  23. Number of mutual links A B

  24. Number of mutual links may not be a good measure

  25. Highly correlated A B

  26. Weak correlation A B

  27. A simple measure of correlation between crimes

  28. A simple Example A B

  29. A minimum spanning tree based on crime correlation

  30. A minimum spanning tree based on crime correlation with a lower threshold of 0.01

  31. The “mess” of sexbuyers

  32. A minimum spanning tree of suspected crimes of suspected sex buyers based on crime correlation

  33. Conclusion • To play with different graphs may give a good first picture of how different crimes are associated with each other • We still need traditional statistical techniques to test hypotheses • Existing software package are not very user friendly (Three different softwares was needed to produce these pictures Windows SQL server, Mathcad and Pajek)

  34. Example 3 Data about inpatients in a hospital system

  35. The hospital network

  36. The network • All hospitalizations of individuals in Stockholm 2001-2002 • 295,108 individuals • 570,382 institutional, healthcare occasions • 702 wards located at different hospitals • The mean number of patients admitted to the wards, per day, varied between one and 69 (mean 10.05 and standard deviation 9.44)

  37. Degree distributions

  38. Duration of hospital stays

  39. Problem with Swedish public office data • You usually have to pay for the data • You are only allowed to use the data for the purpose you bought i for • You can’t share the data for free • Swedish data may not be of general interest

  40. A last animation

  41. Relevant publications

More Related