1 / 5

Text Streams for PlanetData

Text Streams for PlanetData. Marko Grobelnik, Blaz Fortuna, JSI. Objectives. The goal is to implement the system for exploratory analysis of textual data steams …as a case studies we’ll use Spinn3r ( http://spinn3r.com/ ) stream with ~40M documents per day

kaden
Download Presentation

Text Streams for PlanetData

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Streams for PlanetData Marko Grobelnik, Blaz Fortuna, JSI

  2. Objectives • The goal is to implement the system for exploratory analysis of textual data steams • …as a case studies we’ll use • Spinn3r (http://spinn3r.com/) stream with ~40M documents per day • News archives (NYTimes, Reuters, Nature) • …different components of the system are developed in different projects (RENDER, ALERT, ENVISION, XLike, LTWeb, …) • Unique part for PlanetData is “Exploratory part” which will enable content browsing of the incoming data through web-service or GUI • Another contribution is common data format for textual streams including content and social data • …the solution is aligned with “Social context” activity in PlanetData

  3. Architecture Interface (GUI, API) User interface and web service API to the lower levels Mining Algorithms Wrappers around mining algorithms from GLib Data Layer Efficient access and sampling from the data sources Adapter Adapter Adapter Adapter Data Source Adapters Data Source Data Source Data Source Data Source Multiple data inputs of structured and non-structured data

  4. Multi-view Support • Feature extractor • Transforms record field into a sparse vector • Types • Numeric (e.g. visit count) • Nominal (e.g. country) • Multinomial (e.g. categories) • Tokenizable (e.g. title) • Extracts TF-IDF vectors • DateTime (e.g. visit time) • Extracts time-of-day, day-of-week, etc. • … • Dynamic extraction of feature vectors • Can combine any set of fields • Implicit joins during feature extraction

  5. Mining Algorithms as plug-ins Operators Aggregators Wrapper around visualization and summarization algorithms Input: set(s) of records Examples: Keyword extraction Document Atlas Histograms … • Wrapper around mining and learning algorithms • Input: set(s) of records • Output: set(s) of records • Examples: • Search • K-Means clustering • Active Learning • SVM classifier • …

More Related