1 / 32

Helping Interdisciplinary Vocabulary Engineering (HIVE)

Helping Interdisciplinary Vocabulary Engineering (HIVE). OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José Ramón Pérez Agüera Lee Richardson Ryan Scherle Todd Vision Hollie White Craig Willis. Overview. Part 1 Introduction to HIVE

helene
Download Presentation

Helping Interdisciplinary Vocabulary Engineering (HIVE)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José Ramón Pérez Agüera Lee Richardson Ryan Scherle Todd Vision Hollie White Craig Willis

  2. Overview Part 1 • Introduction to HIVE • Underlying rationale • A scenario • Research and challenges Part 2 • Technical overview and implementation • Progress and challenges • Next steps Part 3 • Let you experiment

  3. José R. P. Agüera HIVE Team Lina Huang Bob Losee Lee Richardson Madhura Marathe Hollie White Jane Greenberg Craig Willis Ryan Scherle

  4. HIVE model • <AMG> approach for integrating discipline CVs • Model addressing C V cost, interoperability, and usability • constraints (interdisciplinary environment) 4

  5. Data underlying peer-reviewed articles in the basic and applied biosciences 5

  6. Vocabulary needs for Dryad • Vocabulary analysis • 600 keywords, Dryad partner journals • Vocabularies: NBII Thesaurus, LCSH, the Getty’s TGN, ERIC Thesaurus, Gene Ontology, IT IS (10 vocabularies) • Facets: taxon, geographic name, time period, topic, research method, genotype, phenotype… • Results 431 topical terms, exact matches • NBII Thesaurus, 25%; MeSH, 18% 531 terms (topical terms, research method and taxon) • LCSH, 22% found exact matches, 25% partial • Conclusion: Need multiple vocabularies

  7. HIVE work-plan HIVE Goals 1. Building HIVE Vocabulary preparation Server development 2. Sharing HIVE Continuing education (empowering information professionals) 3. Evaluating HIVE Examining HIVE in Dryad 3 Phases • Provide efficient, affordable, interoperable, and user friendly access to multiple vocabularies during metadata creation activities • Present a model and an approach that can be replicated • —> not necessarily a service

  8. HIVE Partners Vocabulary Partners Advisory Board Jim Balhoff, NESCent Libby Dechman, LCSH Mike Frame, USGS Alistair Miles, Oxford, UK William Moen, University of North Texas Eva Méndez Rodríguez, University Carlos III of Madrid Joseph Shubitowski, Getty Research Institute Ed Summers, LCSH Barbara Tillett, Library of Congress Kathy Wisser, Simmons Lisa Zolly, USGS WORKSHOPS HOSTS: Columbia Univ.; Univ. of California, San Diego; George Washington University; Univ. of North Texas; Universidad Carlos III de Madrid, Madrid, Spain • Library of Congress: LCSH • the Getty Research Institute (GRI): TGN (Thesaurus of Geographic Names ) • United States Geological Survey (USGS): NBII Thesaurus, Integrated Taxonomic Information System (ITIS) • National Library of Medicine and the National Agricultural Library

  9. HIVE is for… • HIVE for resource creators - w/Dryad: scientists, depositors • HIVE for information professionals: curators, professional librarians, archivists, museum catalogers

  10. ~~~~Amy • Meet Amy Zanne. She is a botanist. • Like every good scientist, she publishes, and she deposits data in Dryad. Amy’s data

  11. Usability • Formal usability study 4 biologist, 5 information professionals ~ Tasks, usability ratings, satisfaction ranking • Average time to search a concept: Librarians: 6.53 minutes Scientists: 3.82 minutes ~ consistent w/research at NIEHS, 2 times as long • Average time for automatic indexing sequence Librarians: 1.91 minutes Scientists: 2.1 minutes Huang, 2010 Huang, 2010

  12. System usability and flow metrics Huang, 2010

  13. Challenges • Building vs. doing/analysis • Source for HIVE generation, beyond abstracts • Combining many vocabularies during the indexing/term • matching phase is difficult, time consuming, inefficient. • NLP and machine learning offer promise • Interoperability = dumbing down • ontologies • Proof-of-concept/ illustrate the differences between HIVE and other vocabulary registries (NCBO and OBO Foundry) • People wanting a service • General large team logistics, and having people from multiple disciplines (also the ++)

  14. HIVE Technical Overview Craig Willis (craig.willis@unc.edu)

  15. Credits Ryan Scherle (Nescent) José Ramón Pérez Agüera (UNC) Lina Huang (UNC) Duane Costa (LTER) Alyona Medelyan & Ian Whitten (Univ. of Waikato/NZDL)

  16. HIVE Technical Overview • HIVE combines several open-source technologies to provide a framework for vocabulary services. • Java-based web services can run in any Java application server • Demonstration website (http://hive.nescent.org/) • Open-source Google Code project (http://code.google.com/p/hive-mrc/) • Source code, pre-compiled releases, documentation, mailing lists

  17. Who’s using HIVE? HIVE is being evaluated by several institutions and organizations: • Long Term Ecological Research Network (LTER) • Prototype for keyword suggestion for Ecological Markup Language (EML) documents. • Library of Congress Web Archives (Minerva) • Evaluating HIVE for automatic LCSH subject heading suggestion for web archives. • Dryad Data Repository • Evaluating HIVE for suggestion of controlled terms during the submission and curation process. (Scientific name, spatial coverage, temporal coverage, keywords). • Yale University, Smithsonian Institution Archives

  18. HIVE Functions • System for management of multiple controlled vocabularies in SKOS format • Single interface for browsing, searching, and indexing using multiple vocabularies. • Natural language and structured (SPARQL) queries • Rich internet application (RIA) demonstration interface • Java API and REST interfaces for programmatic access • Framework for conversion of vocabularies to SKOS

  19. HIVE Components • HIVE Core API • Java API for management of HIVE vocabularies. • HIVE Web Service • Google Web Toolkit (GWT) based interface to demonstrate the HIVE service. Includes Concept Browser and Indexer. • HIVE REST API • RESTful API developed by Duane Costa of the Long Term Ecological Research Network (LTER)

  20. Supporting Technologies • Sesame: Open-source triple store and framework for storing and querying RDF data • Used for primary storage, structured queries • Lucene: Java-based full-text search engine • Used for keyword searching, autocomplete (version 2.0) • H2: Embedded relational database • Stores administrative data, fast concept index, KEA++ lookup tables. • KEA++: Algorithm and Java API for automatic indexing

  21. Architecture

  22. Converting Vocabularies to SKOS Van Assem, Mark. (2010). Converting and Integrating Vocabularies for the Semantic Web. Unpublished dissertation. “We learned that some thesauri have complex structures for which no SKOS counterparts can be found and that for some features care is required in converting them in such a way that they are still usable for their original purpose.”

  23. Converting Vocabularies to SKOS • SKOS does not fit all vocabularies/thesauri • For example, MeSH • Is a MeSH descriptor a SKOS Concept? • “A Method to Convert Thesauri to SKOS” (van Assem et al) • http://thesauri.cs.vu.nl/eswc06/ • Or is a MeSH concept a SKOS concept? • “Converting MeSH to SKOS for HIVE” • http://code.google.com/p/hive-mrc/wiki/MeshToSKOS • Either way, information is lost about the vocabulary

  24. Converting Vocabularies to SKOS • Additional information • http://code.google.com/p/hive-mrc/wiki/VocabularyConversion • Each vocabulary has different requirements

  25. KEA++ for Keyphrase Extraction Medelyan, O. and Whitten I.A. (2008). “Domain independent automatic keyphrase indexing with small training sets.” Journal of the American Society for Information Science and Technology, (59) 7: 1026-1040). • Algorithm and open-source Java library for extracting keyphrases from documents using SKOS vocabularies. • Domain-independent machine learning approach with minimal training set (~50 documents). • Leverages SKOS relationships and alternate/preferred labels • Developed by Alyona Medelyan (KEA++), based on earlier work by Ian Whitten (KEA) University of Waikato, New Zealand (http://www.nzdl.org/Kea/) • (Expanded implementation in Medelyan’s MAUI)

  26. KEA++: Feature definition Medelyan, O. and Whitten I.A. (2008). “Domain independent automatic keyphrase indexing with small training sets.” Journal of the American Society for Information Science and Technology, (59) 7: 1026-1040). Medelyan, O. (2010). Human-competitive automatic topic indexing. Unpublished dissertation. Term Frequency/Inverse Document Frequency: Frequency of a phrase’s occurrence in a document with frequency in general use. Position of first occurrence: Distance from the beginning of the document. Candidates with high/low values are more likely to be valid (introduction/conclusion) Phrase length: Analysis suggests that indexers prefer to assign two-word descriptors Node degree: Number of relationships between the term in the CV. (MAUI expands feature set)

  27. HIVE – Upcoming • Vocabulary synchronization • Integration of HIVE with LCSH Atom Feed (http://id.loc.gov/authorities/feed/) • Integration and evaluation of alternative algorithms • As part of the Dryad/HIVE integration • Questions: • What is the best algorithm for automatic term suggestion for Dryad vocabularies? • Do different algorithms perform better for title, abstract, full-text, data? • Do different algorithms perform better for a particular vocabulary/taxonomy/ontology?

More Related