1 / 36

Extracting XML from Unicorn with OAI and SRU

Agenda. Introduction ? Unicorn interfacesPart 1: An OAI frontend for UnicornPart 2: An SRU frontend for UnicornShort description of OAI and SRU protocolsOverview of technical implementationUse cases and demos. Introduction. OAI and SRU are ?open' protocols that permit exchange of metadata bet

cricket
Download Presentation

Extracting XML from Unicorn with OAI and SRU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006

    2. Agenda Introduction – Unicorn interfaces Part 1: An OAI frontend for Unicorn Part 2: An SRU frontend for Unicorn Short description of OAI and SRU protocols Overview of technical implementation Use cases and demos

    3. OAI and SRU are ‘open’ protocols that permit exchange of metadata between information systems Well-known Unicorn interfaces: Unicorn API server Unicorn Webcat/iBistro/iLink server Unicorn Z39.50 server All comply to the philosophy of request/response sequences

    7. API: Proprietary low interoperability level HTML: Record data not well structured low reusability level Z39.50: Protocol specific more difficult to implement (high learning curve) Z39.50 is statefull ?Difficult to integrate into today’s web services environments ?communication: use HTTP ?information exchange: use open protocols (like OAI and SRU) ?record data structure: use XML (according to well-defined XML Schema)

    8. HTTP / Open / XML OAI-PMH: Open Archives Initiative – Protocol for Metadata Harvesting SRU: Search and Retrieve via URL

    10. ‘Harvester collects metadata from archives’ Stateless protocol: sequence of OAI requests/responses over HTTP Just harvesting -- NOT searching

    11. OAI requests HTTP GET|POST requests Syntax BASE URL host + port + path of OAI request handler key=value pairs Examples: http://www.cible.ulb.ac.be:80/ cgi-bin/OAI20/catalog? verb=Identify _ http://www.biomedcentral.com/ oai/1.1/bmcoai.asp? verb=GetRecord&identifier=oai:bmc:1471-2105-1-1&metadataPrefix=oai_dc

    12. OAI responses XML encoded bytestreams, containing the records Record = triplet header (unique OAI identifier) metadata about Metadata schemes XML Schema Minimum: unqualified Dublin Core Community specific Example of a record (catkey 450000 from ULB catalogue): oai_dc marc21 umods

    13. Simple : 6 OAI requests/responses Identify http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=Identify _ ListMetadataFormats [identifier] http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=ListMetadataFormats _ ListSets http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=ListSets _ GetRecord identifier, metadataPrefix http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=GetRecord&identifier=oai:ulbcat:245000&metadataPrefix=marc21 _

    14. Simple : 6 OAI requests/responses ListRecords metadataPrefix, [from,until,set] http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=oai_dc _ http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=mhld21&set=elper _ http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=marc21&from=2006-08-01 _ ListIdentifiers metadataPrefix, [from,until,set] http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListIdentifiers&metadataPrefix=oai_dc _

    15. Implementation of the data provider functionality (2001) http://www.openarchives.org/tools/tools.html pick a template and interface with Unicorn through Unicorn database tools Our choice: Object Oriented Perl frontend (H. Suleman – Virginia Tech) _

    17. Example: implementation of the GetRecord request http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=GetRecord&identifier=oai:ulbcat:245000&metadataPrefix=oai_dc 1. Get metadata from Unicorn for catkey 245000 $record = `echo $catkey | catalogdump -of | filtermarc -iALL -od -Ds`; _ @dates = split(‘\|’,`echo $catkey | selcatalog -iK -opr`); 2. Convert ANSEL character set into ISO-LATIN-1 3. Map from MARC to oai_dc _ 4. Format into XML

    18. Example: implementation of the ‘set’ parameter of the ListRecords request http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=oai_dc&set=elper Precompile set as a file of catkeys name of file: « name of set_catkeys » einstein_albert_catkeys elper_catkeys sd_catkeys all_catkeys through periodic execution of « mkoaisets » custom report

    19. Example: implementation of the ‘from/until’ parameters of the ListRecords request http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=oai_dc&from=2006-08-01&until=2006-08-31 BRS index on creation/modification date? Every Unicorn record that gets created or modified is ‘touched’ in the ‘textedit’ and ‘browsedit’ directories Custom report ‘cadutext’ saves catkeys to <ud>/Savedkeys/adutext/rptid adds line ‘rptid|date|status’ to <ud>/Lastruns/cadutext Example: « from=2006-08-01&until=2006-08-31 » obtain report ids for all runs of cadutext after 2006-08-01 and before 2006-08-31 from the file <ud>/Lastruns/cadutext for each of these report ids: obtain catkeys from <ud>/Savedkeys/adutext/rptid and save them to randomnumber_catkeys file sort and uniq the randomnumber_catkeys file

    20. Limitations of implementation: ListRecords/ListIdentifiers: The from and until parameters are not permitted if the set parameter is given on the request The from and until parameters are permitted if the set parameter is not given on the request, but their values should fall within a certain date range (at this moment arbitrarily set to ‘today - 2 months’ and ‘today’) Deleted records Complete source code and documentation available on the API Repository (http://sirsiapi.org)

    23. Use case 1: Vlink - OpenURL resolver system OpenURL sent from iLink http://bibdev.vub.ac.be/cgi-bin/openurlulb? sid=ULB:Webcat&id=oai:ulbcat:617924 This OpenURL does not contain enough metadata for the specific item ==> Vlink does a fetch back to Unicorn through an OAI GetRecord request to obtain a full MARC21 bibliographic description http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=GetRecord&identifier=oai:ulbcat:617924&metadataPrefix=marc21

    24. Use case 1: Vlink - OpenURL resolver system Feed Vlink Knowledge Base through OAI harvesting

    25. Use case 2: Unicat - Virtual Union Catalog of Belgium

    27. ‘Client searches and retrieves metadata records from an archive’ Stateless protocol: sequence of SRU requests/responses over HTTP Search and Retrieve (<-> OAI: harvesting)

    28. SRU requests HTTP GET requests Syntax BASE URL host + port + path of SRU request handler key=value pairs 3 possible requests (operations) explain serves to record facilities available at an SRU server used by clients to self-configure returned explain record is in XML and follows the ZeeRex Schema Example: http://z3950.loc.gov:7090/voyager?version=1.1&operation=explain _ scan allows the client to request a range of the available terms at a given point within a list of indexed terms enables clients to present an ordered list of values and, if supported, how many hits there would be for a search on that term searchRetrieve

    29. searchRetrieve operation searchRetrieve (principal) parameters Version: (of the request); current protocol version: 1.1 query: query expressed in CQL startRecord: position within the sequence of matched records of the first record to be returned maximumRecords: number of records requested to be returned recordSchema: schema requested for the records to be returned stylesheet: URL for an xml stylesheet. The client requests that the server simply return this URL in the response. CQL « Traditionally, query languages have fallen into two camps: Powerful, expressive languages, not easily readable nor writable by non-experts (e.g. SQL, PQF, and XQuery);or simple and intuitive languages not powerful enough to express complex concepts (e.g. CCL and google). CQL tries to combine simplicity and intuitiveness of expression for simple, every day queries, with the richness of more expressive languages to accomodate complex concepts when necessary. » (http://www.loc.gov/standards/sru/cql)

    30. searchRetrieve operation Examples of CQL queries: dinosaur title = "complete dinosaur" title exact "the complete dinosaur" dinosaur not reptile dinosaur and bird or dinobird publicationYear < 1980 title all "complete dinosaur" title contains all of the words: ‘complete’, and ‘dinosaur’ title any "dinosaur bird reptile" title contains any of the words: ‘dinosaur’, ‘bird’, or ‘reptile’ ribs prox/distance<=5 chevrons a more specific proximity query: ‘ribs’ within 5 words of ‘chevrons’

    31. searchRetrieve operation -- examples http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&query=author=einstein _ http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=author=einstein _ http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=author=einstein&recordSchema=dc _ http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=author all "einstein albert“ _ http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“ _ http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleCanevas.xsl _ http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _

    34. SRU/Z39.50 Gateway: YAZ Proxy (Index Data) Implemented at ULB: 7/2006 (2 days) config.xml <target name="cible" default="1"> <url>bib7.ulb.ac.be:2200</url> <xi:include href="explain.xml"/> <cql2rpn>pqf.properties</cql2rpn> </target> <target name=“slavko" default="1"> <url>velma.library.mun.ca:2200</url> <xi:include href="explain.slavko.xml"/> <cql2rpn>pqf.slavko.properties</cql2rpn> </target> explain.xml ZeeRex XML record as response to ‘explain’ operation pqf.properties specifies the mapping of various CQL indexes, relations, etc. into Type-1 query attributes

    35. YAZ Proxy http://bib49.ulb.ac.be:9000/Cible? version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1& query=title all "einstein albert“& stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _ http://bib49.ulb.ac.be:9000/Slavko? version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1& query=title all "einstein albert“& stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _

    36. Seamless integration of catalog searches in CMS Typo3 Example HTML page containing biography of famous belgian historian Henri Pirenne frame pointing to the following URL: http://bib49.ulb.ac.be:9000/Cible? version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1& query=pirenne%20and%20epub-dnu-* &stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl Project Unicorn contains descriptions of databases, websites, etc with local thematic classification codes in 653 create thematic websites within our CMS, containing frames that list available databases per theme

More Related