Cutting-Edge Text Mining Research at BITEM Research Group

Big Data atBITEM Research Group • (Text|Web) Mining Research Group • patrick.ruch@hesge.ch, http://bitem.hesge.ch • Research projects: Digital Libraries, Web, Personalized medicine, Patent analytics, Consumer Analytics, Pharmacovigilance, Clinical trials… • Specialised in (semi|un)structured data • We like text, text and more text • Especially on the noisy/dirty Web • Technological expertise: CouchDB replication, SolrCloud (distributed indexing and search), indexing/searching in SSD/HDFS/Hadoop, SPARQL endpoints…

Web Sources NoSQL Replication Forum CouchDB CouchDB RSS CouchDB CouchDB Twitter API Cleaning Normalisation Solr Cloud 26’000 per day Drugbank 19’000 drugnames checkedeach 10 mn 7 M of docs in 9 months Pharmacovigilance on Big Social Media Data Dynamic and Real Time Data Analysis Correlation Analysis NoveltyDetection Trends Analysis

Proteins annotation based on litterature by curators 23 000 000 articles 40’000 concepts [Big-scaleMulticlass Multilabel Classifier]  Lazylearning ! annotated articles Manual annotation planned for 2045 ! (Baumgartner et al) GOA Machine Learning based on Information Retrievalmethods Assisting curators Macro reading of litterature Profilinganytextual content Managing the data deluge for proteins annotation

Patent retrieval The real situation (0.5-1 TB) Experiments Database 13 millions of patents Database A sample of 1 million of patents Extraction 33 days Extraction 2.5 days XML patents 17 Gb XML patents 0.221 Tb Normalization 33 days Normalization 2.5 days XML patents + metadata 0.234 Tb XML patents + metadata 18 Gb Indexing 5 days Indexing 10 hours Index 0.1 Tb Index 3 Gb

Cutting-Edge Text Mining Research at BITEM Research Group

Cutting-Edge Text Mining Research at BITEM Research Group

Presentation Transcript

S mall B usiness I nnovation R esearch

S mall B usiness I nnovation R esearch

B ig Maths

B ig Question #1

S mall B usiness I nnovation R esearch

S mall B usiness I nnovation R esearch

Group B

Group B

Group B

B REAKOUT S ESSION - B IG B ENCH -

Item #6-B

b R

GROUP B

Group B

R ESEARCH at G ENOME B IOINFORMATICS L AB

The B ig Bang

Group B

B 0 B 0 Oscillations and sin2 b at B A B A R

Group B

R’ b

S mall B usiness I nnovation R esearch

Group B