jsi news crawler n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
JSI News Crawler PowerPoint Presentation
Download Presentation
JSI News Crawler

Loading in 2 Seconds...

play fullscreen
1 / 7

JSI News Crawler - PowerPoint PPT Presentation


  • 162 Views
  • Uploaded on

JSI News Crawler. Blaz Novak, Mitja Trampus , Blaz Fortuna, Marko Grobelnik JSI. JSI News Crawler. The goal is to collect most of worlds news articles including relevant blog posts Why collecting data? To be independent of commercial data providers

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'JSI News Crawler' - leena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
jsi news crawler

JSI News Crawler

Blaz Novak, Mitja Trampus,

Blaz Fortuna, Marko Grobelnik

JSI

jsi news crawler1
JSI News Crawler
  • The goal is to collect most of worlds news articles including relevant blog posts
  • Why collecting data?
    • To be independent of commercial data providers
    • Since commercial data providers (like Spinn3r, GNIP, DataSift) are expensive and not flexible in terms of data sources and additional services
    • To provide data stream free of charge for research
  • What data is available?
    • Database dumps
    • Articles annotated with Enrycher metadata
    • Similar articles clusters
    • Real-time feed
architecture
Architecture
  • Content in form:
  • Clean text
  • Linguistics
  • Social Graph
  • LOD Links
  • Time

Database of Collected Articles

Open Web

JSI Crawler

Enrycher

XML/RDF

Control Panel

Web Service API

Developers

Real-Time

Analytics

Archive

Explorer

current statistics
Current statistics
  • Data sources: ~110.000 unique websites
  • Stream size: ~192.000 articles/day
    • ~150 distinct languages
    • good coverage of minority languages
  • Current archive of ~35.000.000 articles
  • Clear-text and language identification available
slide6

Download volume, yearly scale (2010)

Control Panel

Todays download volume, after adding 3k new sources + 1 week of backlog

Average and maximum number of story articles in a cluster (today)

plans
Plans
  • In the first half of 2012 the plan is to release the service for public use
  • …in the future additional semantic annotation services will be added to providing additional value to the streamed data