1 / 7

JSI News Crawler

JSI News Crawler. Blaz Novak, Mitja Trampus , Blaz Fortuna, Marko Grobelnik JSI. JSI News Crawler. The goal is to collect most of worlds news articles including relevant blog posts Why collecting data? To be independent of commercial data providers

leena
Download Presentation

JSI News Crawler

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. JSI News Crawler Blaz Novak, Mitja Trampus, Blaz Fortuna, Marko Grobelnik JSI

  2. JSI News Crawler • The goal is to collect most of worlds news articles including relevant blog posts • Why collecting data? • To be independent of commercial data providers • Since commercial data providers (like Spinn3r, GNIP, DataSift) are expensive and not flexible in terms of data sources and additional services • To provide data stream free of charge for research • What data is available? • Database dumps • Articles annotated with Enrycher metadata • Similar articles clusters • Real-time feed

  3. Architecture • Content in form: • Clean text • Linguistics • Social Graph • LOD Links • Time Database of Collected Articles Open Web JSI Crawler Enrycher XML/RDF Control Panel Web Service API Developers Real-Time Analytics Archive Explorer

  4. Current statistics • Data sources: ~110.000 unique websites • Stream size: ~192.000 articles/day • ~150 distinct languages • good coverage of minority languages • Current archive of ~35.000.000 articles • Clear-text and language identification available

  5. Sample Article from the stream

  6. Download volume, yearly scale (2010) Control Panel Todays download volume, after adding 3k new sources + 1 week of backlog Average and maximum number of story articles in a cluster (today)

  7. Plans • In the first half of 2012 the plan is to release the service for public use • …in the future additional semantic annotation services will be added to providing additional value to the streamed data

More Related