1 / 10

Parallellization of a semantic discovery tool for a search engine Kalle Happonen Wray Buntine

This paper discusses the parallelization of a semantic discovery tool for a search engine using MPI. It focuses on building a sparse document-word matrix, creating categories based on word occurrence, and assigning membership levels to documents based on relevance. The paper also discusses the challenges faced in terms of data transfer and memory requirements, and proposes possible solutions. The current state and future prospects of the tool are also discussed.

cmaurice
Download Presentation

Parallellization of a semantic discovery tool for a search engine Kalle Happonen Wray Buntine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallellization of asemantic discovery tool for a search engine Kalle Happonen Wray Buntine

  2. MPCA MPCA is a discrete version of PCA (principle components, top K eigenvectors of a matrix) for sparse integer data. To make it simple • goes through documents to build a sparse document-word matrix • creates different categories based on word occurrence • assigns membership level for documents based on relevance Uses are • building categories of documents, e.g., Open Directory Project or DMOZ has hierarchical categories • partitioning a subset of the web into strongly connected regions and identifying hubs (good outgoing links) and authorities (commonly linked to) • also used in the genome project for genotype discovery, although not this particular software

  3. The matrices word comp word node 1 node 1 node 1 comp doc doc node 2 node 2 node 2 output input

  4. word word comp comp doc doc word comp word comp word doc doc comp How it works Master (node 1) Slave 1 (node 2)

  5. The parallel version Relevant facts • The task is divided by splitting up the documents • The component - word matrix is large – a lot of data to transfer • Total amount of data to transfer between loops is linearilly dependent on node amount • The component - word matrix is always in memory – the available memory limits the component amount for a problem What we have done • We are using MPI for parallellization • We have a working tested parallel version – with room for improvement

  6. Complexity

  7. Some performance data Small testrun on a 4 node cluster • 2 * Athlon64 cpus / node • 2 GB memory / node • gigabit ethernet

  8. Scalability Finnish web 3 M documents 10 components Reuters news archive 800 k documents 100 components

  9. Problems Data transfer the limiting factor - solutions • Use a faster network • Compress data before transfer • Change the communication model - possiblyneed to break the algorithm • Run on reduced problem to convergence – then run on full problem Memory requirements • Larger jobs mean more words -> less componentswith the same amount of memory • Run large jobs on nodes with much memory? • Compress data in memory?

  10. The present and the future The Present • Working parallellization with MPI • Succesful runs on clusters • Has been used on the whole Finnish web (3 M docs), Wikipedia (600k docs) and DMOZ (6 M docs) • Problems with memory requirements and communication The Future • Run on much larger document sets • Improve scalability • Runs on a grid

More Related