Metadata concepts, issues and experiences – lessons from 8 years of metadata management at CMR
This presentation is the property of its rightful owner.
Sponsored Links
1 / 40

Tony Rees Divisional Data Centre CSIRO Marine Research, Australia ([email protected]) PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on
  • Presentation posted in: General

Metadata concepts, issues and experiences – lessons from 8 years of metadata management at CMR - for CSE Metadata Workshop, Canberra, May 2005. Tony Rees Divisional Data Centre CSIRO Marine Research, Australia ([email protected]). Overview. Some definitions / concepts

Download Presentation

Tony Rees Divisional Data Centre CSIRO Marine Research, Australia ([email protected])

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Metadata concepts, issues and experiences – lessons from 8 years of metadata management at CMR

- for CSE Metadata Workshop, Canberra, May 2005

Tony Rees

Divisional Data Centre

CSIRO Marine Research, Australia

([email protected])


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Overview

  • Some definitions / concepts

  • Who are the clients for metadata? (what is our target audience)

  • How do people find metadata? (discovery / search mechanisms)

  • The national metadata infrastructure context (ASDD etc.)

  • Search methods – free text vs. structured searches, and the CMR (MarLIN) approach

  • What metadata to collect?

  • Space and time “footprints” in metadata records (storage and search implications)

  • How do we populate the system...

  • Selected implementation aspects (when actually building a system).


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Metadata is …

  • Structured, summary information regarding a dataset or similar resource

  • Conforms to some standard – e.g. ANZLIC (for our region), ISO 19115, can have agency-specific extensions

  • Provides both descriptions of resources (cataloguing / documentation function) and potentially, previews of / access point to the data

  • Definition of “Dataset” – in the eye of the beholder – a logical set of data sharing common attributes e.g. data type, collection method, survey / expt ... – size of data “chunks” (granularity of the metadata) determined by agency practices and preferences

  • Probably good to distinguish dataset-level metadata from item level descriptions (keep in separate, tailored systems).


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Some example metadata systems …

  • GCMD (NASA)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Some example metadata systems (cont’d)…

  • NERC Metadata Gateway (UK)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Some example metadata systems (cont’d) …

  • Australian Spatial Data Directory (another gateway)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Some example metadata systems (cont’d) …

  • MarLIN (CMR metadata system)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

What are we trying to do here?

  • Describe our data holdings – to the inside and outside world

  • Bring together relevant dataset documentation (or pointers to it) in a single, www-accessible location

  • Provide a good (i.e.: tailored) set of search tools which suit our data holdings and “target” users

  • Facilitate access to our data – on a self serve basis (where possible) **

  • Connect our entered information to the wider world for “discovery” purposes, e.g. to metadata gateways and internet search engines

  • Re-use metadata as a “building block” in broader Divisional systems (capture once, use many times) **

    (** = value adding)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Who are the clients for our metadata?

(hopefully not...)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Who are the clients for our metadata?

(hopefully yes...)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Who are the clients for our metadata?

  • CSIRO researchers and their internal / external collaborators (e.g. for data discovery)

  • Divisional management

  • External parties – schools, public, scientific community, policy makers, consultants

  • Ourselves– if an extensive data custodian (use for internal cataloguing / data access purposes)

  • Recipients of CSIRO data – can supply metadata along with data products (also, may be a project deliverable)

  • Future users (v. important) – “corporate memory”


Tony rees divisional data centre csiro marine research australia tony rees csiro au

How do people find metadata?

  • Agency-level systems (own access points)

  • Metadata gateways – e.g. ASDD (Australian Spatial Data Directory) for Australia, NERC metadata gateway for UK

  • Future one-CSIRO system (??)

  • Internet search engines e.g. Google (if mechanism for crawling is enabled)

  • Standalone metadata files (e.g. supplied with data).

    NB: all have their place, e.g. agency-level systems may support richer or better targeted search facilities than those available via gateways.


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Australian Spatial Data Directory – national cross-agency metadata gateway

ASDD

future agency system

CMR

DEH

BoM

GA

etc.

EDD

MarLIN

GA data

BoM data

DEH data

CMR data

etc.

National Metadata Infrastructure

metadata systems

describe / point to ...

  • search via ASDD – search across multiple agencies, basic functionality

  • search via MarLIN – search only CMR holdings, but extra functionality (also view “CMR internal” records not visible to external users)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

ASDD search – across multiple agency systems


Tony rees divisional data centre csiro marine research australia tony rees csiro au

(etc.)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Limitations of text-based searching...

  • Basically a “hit and miss” method – no “browse” capability, or method to broaden / focus the search

  • Relies on searcher and metadata creator using same words for same concepts (does not happen in practice, with free text entry across multiple systems)

  • ... e.g. “whales” vs. “cetaceans” vs. “marine mammals” vs. species scientific names (multiple wordings covering potentially the same concept)

  • Also, converse applies – one word, multiple uses, e.g. shark (fish), shark cat (type of boat), Shark Bay (place)...

  • Variant spellings also a problem (e.g. sea lion vs. sea-lion vs. sealion; fishery vs. fisheries; organization vs. organisation; Mt. vs. Mount...

  • Typographical errors may render document invisible to a free text search (can be at either end, e.g. searcher or stored data).


Tony rees divisional data centre csiro marine research australia tony rees csiro au

cf – Advantages of picklists (“controlled vocabularies”)...

  • Steers users to use “one concept, one descriptor” approach; no spelling variants / errors

  • Can organise thematically / hierarchically, i.e. “shark” under zoology, “Shark Bay” under localities... (less confusion); also can have explicit relationships (broader / narrower, related categories, etc.)

  • Supports structured information retrieval and browsing

  • Good prompt for terms that the searcher (or content creator) may not otherwise think to enter

  • Amenable to global updates (hold list item ID’s in the record, actual values in a look-up table, change in one place only)

  • Can be access point to more extensive stored additional information (e.g. via project, voyage, organisation, publication ID) – content creator picks a value from the list, system automatically adds the rest

    Main difficulties: getting agreement on list content; anticipating all user needs; loss of flexibility / fine detail of expression (i.e., still a need for free text as optional supplement). Also, list maintenance is an overhead.


Tony rees divisional data centre csiro marine research australia tony rees csiro au

e.g. MarLIN approach... (example: search by taxonomic group)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

(etc.)

NB:

(1) this method (in principle) maximises both “recall” (getting records that you do want) and “precision” (not getting records that you don’t want)

(2) fewer “0 records returned” messages (user cannot search on terms not actually used)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

What metadata to collect? – 1

  • Core ANZLIC fields – title, abstract, space and time ranges, data quality, data contact point, ANZLIC search words... (c. 40 fields)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

What metadata to collect? – 2

  • Other fields of value to the agency – e.g...

    • project codes + associated info.

    • more specialised keywords or search terms

    • controlled defined regions list

    • links - data documentation, graphics links, data access

    • stored data volume, stored data location

    • references, contributors, acknowledgements (e.g. funding) ...

  • Some of the above correspond to elements in the ISO standard (c. 400 fields), some will be new

  • Tension between simple metadata set (few elements, but easy to collect) and more extensive dataset information (more effort to collect, but increased future value and / or structured search options).


Tony rees divisional data centre csiro marine research australia tony rees csiro au

CMR Metadata search page (portion)

... in order to be useful for structured searches, relevant information must be captured at metadata entry time, in a consistent way (e.g. via picklists and supporting tables).


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Also need to consider space, time “footprints”, i.e. how to support these at search time

Example for a CMR dataset (“Lira” catch dataset from 1973):


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Dataset time range

(as start, end dates)

Search time range

(as start, end dates)

overlap = “hit”

overlap = “hit”

Dataset bounding box

(as start, end lat & lon)

Search bounding box

(as start, end lat & lon)

Storage of relevant Temporal and Spatial search info: (default)

Machine-readable temporal search:

  • Tend to not worry about temporal patchiness (maybe just add text comment in “completeness” field)

Machine-readable spatial search:

  • Spatial patchiness (or irregular polygon shapes) can be a more serious problem – CMR solution on next slide


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Spatial footprints – improved method

CMR has implemented a grid squares-based system for improved spatial “footprint” representation and querying (without requirement for a full GIS back end):

Dataset spatial extent – stored as list of squares intersected

Search by grid square (or set of squares)

in list = “hit”

not in list = “miss”

  • We use 0.5° x 0.5° squares – same resolution as 1:100 000 mapsheet series (approx. 50 x 50 km)

  • Global “c-squares” notation covers marine as well as land areas.


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Related functionality on Museum Victoria “Bioinformatics” site(search interface shown):

  • Searcher can use this approach to define a non-rectangular region of interest (green highlighted cells)

    (NB, this uses a different [non global] notation for the cells, however the basic principle is the same)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Result for the relevant “Lira” CMR metadata record...

  • Red squares (as square IDs) are what is actually stored, can then be superimposed on any user-selected base map for display purposes

  • Now will not get “false positives” – e.g. from searching at Alice Springs


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Remainder is “standard” metadata (ANZLIC + CMR extensions), e.g...

(etc.)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

How do we populate the system (get people to describe their data)?

  • Non-trivial problem

  • Education – value of metadata, responsibility of data custodians to describe their data in designated system/s

  • Prescriptive approach – build into project planning, sign-off, APA’s

  • Facilitation – dedicated personnel assist scientists, knock on doors

  • Making records on researchers’ behalf – resource intensive, also not ideal since person making the metadata does not have the best understanding of the data

  • Incrementally – e.g. as data is migrated into corporate systems, require the metadata to go with it (robust linkage) – NB, will probably always be “data islands” that this approach misses.


Tony rees divisional data centre csiro marine research australia tony rees csiro au

How far have we got...?

  • Currently there are some 2,100 records in the MarLIN system

(etc.)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

How far have we got...? – cont’d

  • 90-95% of “Data Centre” holdings described – after 8 yr process! (<1000 records, mostly ships’ data, by voyage and data type)

  • a few “data islands” have made concerted attempts to describe their data (e.g. 10-20+ records each)

  • some major data acquisition exercises have generated 50-100+ records, mostly for third party data (generally not visible on extranet) – e.g. where metadata is a specified project deliverable along with the data (good!)

  • remainder is pretty patchy (maybe 10% compliance) – hope to kickstart with project-based “skeleton records”, also more rigid directives / follow up from Divisional management.


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Project data template (example):

(etc.)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

What information model to use?

Ideal world (probably unattainable):

Library pubs. list

Projects database

Metadata system

Persons database

Item-level catalogues

Ancillary information

Stored data

... all information would be entered / maintained in one place only; updates would propagate automatically through the system; all resources would be electronic and seamlessly accessible


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Best we can do for now...

plus some other tables (not shown) for voyages, organisations, keywords...

MarLIN “projects” table

MarLIN “references” table? (or text descriptions)

Metadata system – main “datasets” table

MarLIN “persons” table

MarLIN “doc” links (URLs) in table (also text descriptions)

MarLIN “data” links (URLs) in table (also text descriptions)

MarLIN “doc” + “graphic” links (URLs) in table (also text descriptions)

Item-level catalogues

Stored data

Ancillary information

digital + non-digital

digital + non-digital

digital + non-digital


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Functionality / Processes to be supported (... list probably incomplete!)

  • User interfaces – create, edit, search metadata records

  • Administrator functions – user identities and privileges, “super-user”-level record modification, deletion, list maintenance

  • Moderator function – approve / edit content to be published

  • Security / authentication – who can access “internal” records (e.g. by specified IP domains or other mechanism)

  • Access logging – including what search terms used, how many “hits”, etc. (plus applications to review user log and access stats)

  • Application maintenance, tech. support, user training

  • Automated connections to remote systems, plus on-demand import / export features (e.g. via XML)

  • Ongoing development / modification to functionality or database structure – process, resources...


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Metadata integration / remote calls (examples)

  • Project work space (HTML page)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Metadata integration / remote calls (examples)

  • Custom MarLIN search via web call (from different database)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Metadata integration / remote calls (examples)

  • Re-use of MarLIN supporting tables content (in other contexts)


Tony rees divisional data centre csiro marine research australia tony rees csiro au

Concluding remarks

  • Simple in theory, not so simple in practice, to design and implement a good system (especially in a research, rather than basic “products set” environment) – no “off the shelf” solution (or even key components) available

  • Designing a system gives the opportunity to incorporate new / improved concepts (scope for innovation, design challenges)

  • Should be benefits in sharing code, approaches, experiences across Divisions or other groups

  • Populating the system is as important as building it!

  • Connection to external gateways is not too hard, once system plus some publishable content exists

  • CMR is a lonely trailblazer within CSIRO .. still considered an example of “best practice” (a bit of a worry, seeing how far we still have to go)...


Tony rees divisional data centre csiro marine research australia tony rees csiro au

  • Thanks!

  • To visit MarLIN:

    go to www.marine.csiro.au

    >> Data Centre (www.marine.csiro.au/datacentre/)

    >> MarLIN (www.marine.csiro.au/marlin/)

  • MarLIN “Edit” interface – currently requires access privileges to visit (will look at online in tomorrow’s session).


  • Login