1 / 21

Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome

Webscraping at Statistics Netherlands. Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome. Content. Internet as a datasource (IAD): motivation Some IAD projects over past years Technologies used Summary / trends Observations / thoughts Legal The Dutch Business Register.

kdowell
Download Presentation

Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Webscraping at Statistics Netherlands Olav ten Bosch23 March 2016, ESSnet big data WP2, Rome

  2. Content • Internet as a datasource (IAD): motivation • Some IAD projects over past years • Technologies used • Summary / trends • Observations / thoughts • Legal • The Dutch Business Register

  3. The why Internet sources Faster, better, more efficient Administrative sources • Tax, social security services • Municipalities/ Provinces • Supermarkets • … • … • Surveys New indicators Less!!!

  4. Fuel prices (2009) • Daily fuel prices from website of unmanned petrol stations (tinq.nl) • Regional prices (per station) every day Now: 2016: • A direct data feed from travelcard company, weekly • Fuel prices per day and all transactions of that week • Publication in website: prices per month

  5. Airline tickets (2010) • Pilot: 3 robots on 6 airline companies • 2 robots by external companies, 1 by SN • Prices comply with manual collection • Quite expensive; negative business case • 2016: still manual price collection of airline tickets

  6. Housing market • Housing market (from 2011): • Discussions with external company for > 1 year (iWoz) • We scraped 5 sites, about 250.000 observations / week, 2 years 2013 ->: • Direct feed from one of the sites (Jaap.nl) • Statline tables: Bestaande woningen in verkoop • “based on 80-90 percent of the market”

  7. Bulk price collection for CPI (1) • Bulk price collection for CPI (from 2012): • Mainly clothing • Software scrapes all prices and product data (id, name, description, category, colour, size,…) 2016: • About 500.000 price observations daily from 10 sites • Data from 3 sites used in production of Dutch CPI • Price collection process embedded in organisation • Plans to extend to > 20 sites; other domains

  8. Bulk price collection for CPI (2) Features: Fine-knit Jumper Dark blue Striped Cotton edges Data collection & Feature extraction Structured data Big Data Index methods Index based on internet data Processing bulk data from the Internet

  9. Robot-assisted price collection • Robot tool for detecting price changes on (parts of) websites • Traffic light indicates status: • Green: nothing changed, prices is saved in database • Red: some change, need attention of statistician • Two click to hold old price or store a new one • In production from 2014

  10. Collect data on enterprises for EGR (2013) • Pilot: find data about EGR enterprises on the web • We scraped semi structured data from Wikipedia • Multiple wikipedia languages (NL, EN, DE, FR) • 2016: something alike in ESSnet BD WP2?

  11. Search product descriptions for classifying business activities • Search product descriptions on web (from 2014) • First time we used automated searchwith Google search API for statistics • Pilot, no production • Some doubts on google results

  12. Twitter-LinkedIn (1) • LinkedIn-Twitter for profiling (2015) • Automated search on LinkedIn based on a sample of twitter users • Very specific and experimental • “Profiling of Twitter data, a big data selectivity study”, Piet Daas, Joep Burger, Quan Lé, Olav ten Bosch

  13. Scraping websites of enterprises • Identify family businesses (search and / or crawling) (2016) • Identify businesses with a Corporate Social Responsibility (CSR) (search and / or crawling) (2016) • Research program: • “Extracting information from websites to improve economic figures” • This ESSnet BD WP2 !!!

  14. Crawling for Statistics Incomplete statistical data Url-base Search terms Navigation terms Focused Crawler (Roboto) Internet Item identifyer terms “year report, family business” More complete statistical data Search & Match ElasticSearch Datastore

  15. Technologies used • Perl (2009), Djuggler (2010) • Python, Scrapy (2010) • R (2011-2015) • NodeJS (Javacript on server) (2014-) • Google Search API (2014-) • ElasticSearch (2016) • Roboto (nodejs package, 2015-2016) • Nutch: tested, not used • Generic Framework (robot framework) for bulk scraping of prices

  16. Summary / trends

  17. Observations / thoughts … • If it is there, we can get it • Technology is (usually) not the problem! • The internet is a living thing! • It’s too simple to think we can just buy the internet somewhere and then make statistics! • It’s powerful to combine something we know with something we observe! • External companies can help, but be careful …

  18. Legal • Dutch Statistics Law: • Enterprises have to provide data to Statistics Netherlands on request • Scraping information from websites reduces response burden • Statistics Netherlands does use data for official statistics only • Dutch database legislation: • Commercial re-use of intellectual property is forbidden • This may also apply to internet sources • Privacy: • Dutch (statistical) legislation on protection of personal information • Statistics Netherlands does only scrape public sources and processes data within Statistics Netherlands’ safe environment, just as with other (privacy-sensitive) data internally • Netiquette: • respect robots.txt • identify yourself (user-agent) • do not overload servers, use some idle time between requests

  19. Dutch Business Register (simplified) - From administrative units to statistical units: • Sources: • Trade Register • Tax Register • Social security register (employees) • Profilers • About 1.5 Million administrative entities • About 0.5 Million have a url • Quality of url field not known, but seems usable

More Related