1 / 9

Data portal based on Open Archives Initiative Protocols and Apache Lucene

WDC-MARE – World Data Center for Marine Environmental Sciences. Data portal based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler , uschindler@wdc-mare.org Michael Diepenbroek, mdiepenbroek@wdc-mare.org MARUM, University of Bremen, Germany EGU 2006, Vienna, 2006-04-03.

lesa
Download Presentation

Data portal based on Open Archives Initiative Protocols and Apache Lucene

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WDC-MARE – World Data Center for Marine Environmental Sciences Data portal based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler, uschindler@wdc-mare.orgMichael Diepenbroek, mdiepenbroek@wdc-mare.org MARUM, University of Bremen, Germany EGU 2006, Vienna, 2006-04-03

  2. Data Portals • WDC-MARE with its information system PANGAEA provides data portals for several EU/international projects: • CARBOOCEAN, EUR-OCEANS, IODP • Problem: • Not all data are stored centralized, so all datasets provided in portals must be consolidated from different sources!

  3. Example: CARBOOCEAN data portal • Data stays at the data providers • Metadata is harvested by the portal • Search queries are handled by the centralized catalogue • Scientist gets link to data at the provider

  4. Open Archives Protocol • The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed by the Open Archives Initiative. • uses it during web crawling ( Scholar) • Almost all digital libraries support it (most famous ones: arXiv and the CERN Document Server) • Very simple to implement (XML over HTTP based) • Repository software for databases or file system metadata providers is widely available

  5. Current OAI-PMH software • Limited to Dublin Core metadata (libraries)! • Limited full text search functionality due to relational databases in the background! • No geographic retrievals (because of Dublin Core limitation)! • End user interface is part of the software, this limits usability in CMS systems ???

  6. Requirements for portal software • Open for any XML metadata format • Any mappings to document fields should be done by XPath • Possibility to map incompatible XML schemas during harvesting by XSL • No relational database, only a full text search engine, that contains everything needed for operation • Range queries for specific fields (date/time or numeric) • Web service interface for the end user software that is accessible from any language (Java/JSP, PHP, Perl,...)

  7. MetadataPortal Java Package Lucene OAI- PMH OAI- Harvester OAI protocol in HTTP Virtual Index Apache Axis XSL Lucene OAI- PMH OAI- Harvester OAI protocol in HTTP (specific set) Virtual Index XSL Lucene XML- Files Filesystem- Harvester filesystem directory, FTP,… Mini PanHTTP Server Jetty HTTP Server Tomcat Portal 1(Webserver, PHP) Portal 2(Webserver, JSP) Stored: xmldata (same format everywhere, XSL before indexing), identifier, lastModified, sets Searchable: field1: “/oai_dc:dc/dc:author”field2: “/oai_dc:dc/dc:title”field3: “java:org.test.LatLon.parse(/oai_dc:dc/dc:coverage)” *default: “.” *) xmlns:java=“http://xml.apache.org/xalan/java” !!!

  8. CARBOOCEAN Data Portal • Metadata standard harvested for search: DIF v9.4 • Searchable fields: Bounding box, date/time, parameters, authors, investigators, title • Data centers: World Data Center for Marine Environmental Sciences (WDC-MARE), University of Bremen and Alfred-Wegener-Institute in Bremerhaven, Germany French National Oceanographic Data Centre, SISMER (Systèmes d'Informations Scientifiques pour la Mer) at the Ifremer in Brest, France Carbon Dioxide Information Analysis Center (CDIAC), Environmental Sciences Division at Oak Ridge National Laboratory, USA

  9. Thank you!

More Related