b ig data at b item r esearch group n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
B ig Data at B ITEM R esearch Group PowerPoint Presentation
Download Presentation
B ig Data at B ITEM R esearch Group

Loading in 2 Seconds...

play fullscreen
1 / 4

B ig Data at B ITEM R esearch Group - PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on

B ig Data at B ITEM R esearch Group. ( Text|Web ) Mining Research Group patrick.ruch@hesge.ch, http:// bitem.hesge.ch Research projects: Digital Libraries, Web, Personalized medicine, Patent analytics, Consumer Analytics, Pharmacovigilance , Clinical trials…

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'B ig Data at B ITEM R esearch Group' - calvin-roach


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
b ig data at b item r esearch group
Big Data atBITEM Research Group
  • (Text|Web) Mining Research Group
    • patrick.ruch@hesge.ch, http://bitem.hesge.ch
  • Research projects: Digital Libraries, Web, Personalized medicine, Patent analytics, Consumer Analytics, Pharmacovigilance, Clinical trials…
  • Specialised in (semi|un)structured data
    • We like text, text and more text
    • Especially on the noisy/dirty Web
  • Technological expertise: CouchDB replication, SolrCloud (distributed indexing and search), indexing/searching in SSD/HDFS/Hadoop, SPARQL endpoints…
slide2

Web Sources

NoSQL Replication

Forum

CouchDB

CouchDB

RSS

CouchDB

CouchDB

Twitter API

Cleaning

Normalisation

Solr Cloud

26’000 per day

Drugbank

19’000 drugnames

checkedeach 10 mn

7 M of docs in 9 months

Pharmacovigilance

on Big Social Media Data

Dynamic and Real Time Data Analysis

Correlation

Analysis

NoveltyDetection

Trends

Analysis

slide3

Proteins annotation

based on litterature

by curators

23 000 000 articles

40’000 concepts

[Big-scaleMulticlass

Multilabel Classifier]

 Lazylearning !

annotated articles

Manual annotation planned for 2045 !

(Baumgartner et al)

GOA

Machine Learning based on Information Retrievalmethods

Assisting

curators

Macro reading of litterature

Profilinganytextual content

Managing the data deluge for proteins annotation

patent retrieval
Patent retrieval

The real situation (0.5-1 TB)

Experiments

Database

13 millions of patents

Database

A sample of 1 million of patents

Extraction

33 days

Extraction

2.5 days

XML patents

17 Gb

XML patents

0.221 Tb

Normalization

33 days

Normalization

2.5 days

XML patents + metadata

0.234 Tb

XML patents + metadata

18 Gb

Indexing

5 days

Indexing

10 hours

Index

0.1 Tb

Index

3 Gb