Parallellization of a semantic discovery tool for a search engine Kalle Happonen Wray Buntine

Parallellization of asemantic discovery tool for a search engine Kalle Happonen Wray Buntine

MPCA MPCA is a discrete version of PCA (principle components, top K eigenvectors of a matrix) for sparse integer data. To make it simple • goes through documents to build a sparse document-word matrix • creates different categories based on word occurrence • assigns membership level for documents based on relevance Uses are • building categories of documents, e.g., Open Directory Project or DMOZ has hierarchical categories • partitioning a subset of the web into strongly connected regions and identifying hubs (good outgoing links) and authorities (commonly linked to) • also used in the genome project for genotype discovery, although not this particular software

The matrices word comp word node 1 node 1 node 1 comp doc doc node 2 node 2 node 2 output input

word word comp comp doc doc word comp word comp word doc doc comp How it works Master (node 1) Slave 1 (node 2)

The parallel version Relevant facts • The task is divided by splitting up the documents • The component - word matrix is large – a lot of data to transfer • Total amount of data to transfer between loops is linearilly dependent on node amount • The component - word matrix is always in memory – the available memory limits the component amount for a problem What we have done • We are using MPI for parallellization • We have a working tested parallel version – with room for improvement

Complexity

Some performance data Small testrun on a 4 node cluster • 2 * Athlon64 cpus / node • 2 GB memory / node • gigabit ethernet

Scalability Finnish web 3 M documents 10 components Reuters news archive 800 k documents 100 components

Problems Data transfer the limiting factor - solutions • Use a faster network • Compress data before transfer • Change the communication model - possiblyneed to break the algorithm • Run on reduced problem to convergence – then run on full problem Memory requirements • Larger jobs mean more words -> less componentswith the same amount of memory • Run large jobs on nodes with much memory? • Compress data in memory?

The present and the future The Present • Working parallellization with MPI • Succesful runs on clusters • Has been used on the whole Finnish web (3 M docs), Wikipedia (600k docs) and DMOZ (6 M docs) • Problems with memory requirements and communication The Future • Run on much larger document sets • Improve scalability • Runs on a grid

Parallellization of a semantic discovery tool for a search engine Kalle Happonen Wray Buntine

Parallellization of a semantic discovery tool for a search engine Kalle Happonen Wray Buntine

Presentation Transcript

Choosing a Search Engine

Choosing a Search Engine

Semantic Markup and Search Engine Optimization

Frompo a Search Engine

Semantic Markup and Search Engine Optimization

A Semantic Web Search and Metadata Engine

The Lincoln Project Building a Web-Scale Semantic Search Engine

A Linguistic Approach for Semantic Web Service Discovery

A Search Engine for 3D Models

JENA –A SEMANTIC WEB TOOL

Aquabrowser : a search engine for Libraries

Swoogle: A Semantic Web Search and Metadata Engine

SemSearch : A Search Engine for the Semantic Web

Federated Search: A Tool for Knowledge Discovery

Anatomy of a search engine

A search engine for a mixture of European languages

XSEarch: A Semantic Search Engine for XML

Introduction to semantic search engine

Nutch Search Engine Tool

Quality of a search engine

Improving Data Discovery Through Semantic Search

The Anatomy Of A Search Engine