1 / 19

J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

GI-DAYS MÜNSTER A software tool for thesauri management, browsing and supporting advanced searches. J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003. Contents. Introduction Architecture of THManager application Basic capabilities Enhanced capabilities

kasia
Download Presentation

J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GI-DAYS MÜNSTERA software tool for thesauri management, browsing and supporting advanced searches J. Nogueras-Iso, J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

  2. Contents • Introduction • Architecture of THManager application • Basic capabilities • Enhanced capabilities • Conclusions

  3. Introduction to thesauri • „ A thesaurus is a set of terms that describe the vocabulary of a controlled indexing language, formally organized so that the a priori relationships between concepts (for example synonymous terms, broader terms, narrower terms and related terms) are made explicit“ [ISO 2788] • Used to improve the precision and recall of information retrieval in digital libraries • provide a uniform and consistent vocabulary for indexing metadata ("description of the data holdings“) • supply users with a suitable vocabulary for the retrieval. • expansion of users queries by automatically adding new terms to the query

  4. Introduction to thesauri • A thesaurus management tool becomes a vital component in the development of any kind of digital library • One of the main objectives of Spatial Data Infrastructures is to provide the discovery, evaluation and access to spatial data for a community of users. • an SDI can be considered as digital library specialised in geographic information resources. • A thesaurus management tool will be also a vital component for the development of SDIs.

  5. ThesaurusMngmt ThManager Thesaurus.gui Generic GUI components for thesauri visualization enhanced enhanced Thesaurus.model Keywords Thesaurus management Import/export Keywords expansion Polisemy WordNet Polisemy extraction Branch disambiguation Lexicon Architecture of THManager application Level 3. Application basic Level 2. GUI Level 1. Model << JDBC >> Level 0. Database • Thesaurus • 100% SQL (basic) • Oracle IntermediaText (enhanced) WordNet files Metadata records

  6. Basic Capabilities • Edition of thesauri according to ISO norms • Broader (BT), narrrower terms (NT) • Related terms (RT), preferred terms (PT) • Scope notes (SN), Synonyms (SYN,USE) • Language translations (TR) • Visualization of thesauri • Hierarchical, alphabetical • Search of terms • Multilingual access support • Browsing according to the language selected by users • Import/Export • Text file proprietary formats

  7. Browsing /Edition

  8. Import/export formats • Formats • Dot based notation • sucession of narrower terms + additional relationships (SYN,TR, ...) • Hierarchical Numbering of terms • It should use more standardized formats: • RDFS/XML, ...

  9. Enhanced capabilities • Thesauri are intended for the homogeneous classification of resources • They are used to fill metadata keywords • However, there is still heterogeneity in metadata keywords • Metadata creators use different thesauri in different application domains • If metadata catalogs provide access to general public • Queries may not contain same terms as keywords in metadata records • A possible solution to fill the semantic gap • Disambiguation of thesauri (and queries) in relation with the concepts of an upper level ontology

  10. WordNet Controlled list 1 Other knowledge representation models Controlled list 2 Controlled list N Thesaurus 1 Thesaurus N Thesaurus 2 Enhanced capabilities • Additional tools around semantic disambiguation • Browsing WordNet as another thesaurus • Searching polysemic senses in WordNet • Thesauri disambiguation • Automatic Expansion of Keywords

  11. WordNet is structured in a hierarchy of synsets Synsets are defined as set of synonyms representing a particular concept (sense) WordNet libraries and files are accessed by JNI Browsing WordNet

  12. Searching polysemic senses in WordNet • Functionality provided by Polisemy package • Compound terms are partioned if no synset is found • If adjectives found, associated nouns are also searched to reduce number of not-found words

  13. accident administration environmental accident major accident traffic accident work accident accident source technological accident ... nuclear accident shipping accident accident explosion oil sick leakage core meltdown Thesauri Disambiguation • Unsupervised disambiguation method • The senses of every thesaurus term are searched in WordNet. • The hierarchical structure of the thesaurus is used as the word context for a voting algorithm to find the closest sense • Thesauri are partitioned into branches (trees formed by BT/NT terms whose root has no BT)

  14. Thesauri Disambiguation II • Voting algorithm to obtain the disambiguated synset of a term a • Every synset s associated to the rest of terms in the branch votes (proximity weight) for the synsets of term “a” • Main weight: number of subsummers in WordNet hierarchy • Matches in WordNet hierarchy of ancestors • Discounting factors: • Synset depth • Branch distance • Polisemy of term associated with synset “s”

  15. Annotation of disambiguated synsets Thesauri disambiguation III

  16. Comparison between the initial collection of synsets and the synsets of a new term Automatic expansion of keywordswith new disambiguated thesauri

  17. Expansion of keywords II

  18. Conclusions & future lines • ThManager is a flexible tool to manage thesauri • It provides enhanced functionality for the improvement of classifications. • This tool can be easily integrated in other tools • It is used by a metadata edition tool (also presented here) to select the appropriate term for the distinct metadata fields. • Future lines: • Creation of a thesaurus Web Service providing some of the functionality offered by this tool. • thesaurus browsing, WordNet polysemy extraction, keywords expansion, ... • Concept based retrieval • Exploit the semantic disambiguation of thesauri to test different information retrieval strategies for geographic data catalogs. • It is possible to index metadata records according to a unified system: the disambiguated WordNet synsets

  19. Advanced Information Systems Laboratory http://iaaa.cps.unizar.es

More Related