1 / 20

Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE. Bruno Pinheiro bfp@cin.ufpe.br Renato Correa renato.correa@ufpe.br. Guide. Information Retrieval Systems (IRS) IRS + SOM Related Works Document Collection System Architecture Methodology

nedaa
Download Presentation

Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self-organizing maps applied to information retrieval of dissertations and thesesfrom BDTD-UFPE Bruno Pinheirobfp@cin.ufpe.br Renato Correarenato.correa@ufpe.br

  2. Guide • Information Retrieval Systems (IRS) • IRS + SOM • Related Works • Document Collection • System Architecture • Methodology • Results

  3. InformationRetrieval Systems (IRS) • Indexing, Searching , classifying textual documents. • User’sinformationneeds • Matchinguser’squeriesandsystem’svocabulary.

  4. IRS + SOM Self-OrganizedMaps InformationRetrieval System

  5. IRS + SOM • Navigation Interface build troughdocumentmaps • Document’smaps • Self-Organizing Map trained with document vectors

  6. Related Works • First Works (1991 - 1995) • Lin / Merkl • Great projects(1996 -2000) • Arizona Digital Library, WEBSOM , SOMLib • Diversification (2001 - 2005) • LiGHtSOM, GHSOM, H2SOM • Convergence (2006)

  7. Document Collection • UFPE Digital LibraryofThesesandDissertations(BDTD-UFPE) • Offers in full all the theses and dissertations produced on the graduate programs of the university. • Approximately 6000 documents. • Linked to Brazilian BDTD and to NDLTD (Networked Digital LibraryofThesesandDissertations)

  8. Document Acquisition Documents’ content Document Indexing InvertedIndex Document Representation DocumentVectors Dimensionality Reduction ReducedVectors Volume Reduction PrototypeVectors Construction of Document Map DocumentMap Construction of User Interface System Architecture

  9. Methodology • Document Acquisition • Harvesting process through the OAI-PMH protocol • XMLscontainingdocument’smetadata • Data extraction through the java library JColtrane

  10. Methodology • Indexing • Java library, Lucene. • Stemmingoperations, digitsandstopwordselimination. • Inverted index built through vectorial space model.

  11. Methodology • Documentrepresentation • Documents are representedbyvectors, where terms are the indexes and the corresponding values are functions of term’s frequency of occurrence in the document.

  12. Methodology • Dimensionalityreduction • Feature selection based on words’ frequency • Stopwords elimination • Final dimensionality: 13095 terms • Volume reduction • Not used. • Volume : 4781 documents

  13. Methodology • Document’smapconstruction • Single stage • somtoolbox functions for MATLAB • Document’s vectors normalized before training • SOM map with rectangular structure (10 x 12) and hexagonal neighborhood

  14. Methodology • Document’smapconstruction • Weights initialized linearly along the two greatest eigenvectors • Batch-type SOM algorithm with dot product metric • Gaussian neighborhood function • Neighborhood size linearly decreasing with the number of epochs

  15. Methodology • Document’smapconstruction • Parameters • Number of epochs • Rough phase : 10 epochs • Fine-tuning phase : 10 epoch • Neighborhood size • Rough phase • Initial: [(biggest dimension units number )/2 ]+ 1 • Final: 2 • Fine-tuning phase: • Initial: 2 • Final: 0.8

  16. Methodology • User’s interface construction • Documents are mapped to the node with the closest model vector in terms of cosine distance • Each map node is labeled according to the category • Knowledge areas (CHLA, CBS, TCEN) • Graduate programs

  17. Results

  18. Results KnowledgeAreas GraduatePrograms

  19. Acknowledgement

  20. THANK YOU! Questions? Bruno Pinheiro bfp@cin.ufpe.br Renato Correa renato.correa@ufpe.br

More Related