1 / 23

Identifying and accessing imaging Flowcytobot data and imagery

Identifying and accessing imaging Flowcytobot data and imagery. Information architecture and prototype Joe Futrelle, Heidi Sosik Woods Hole Oceanographic Institution August 2011. What / why / so?. Goal: improve access to IFCB data Consistent, unique identifiers for all important datasets

sadie
Download Presentation

Identifying and accessing imaging Flowcytobot data and imagery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identifying and accessingimaging Flowcytobotdata and imagery Information architecture and prototype Joe Futrelle, Heidi Sosik Woods Hole Oceanographic Institution August 2011

  2. What / why / so? • Goal: improve access to IFCB data • Consistent, unique identifiers for all important datasets • Standard representations of measurements and other metadata • Ability to cite and link to IFCB data in a variety of contexts • What problem does this solve? • Current data access requires access to IFCB laboratory computers, a Matlab license, ability to read Matlab code, and advice from Heidi and/or Rob • Who cares? • Users of IFCB data: get improved ability to use existing tools with IFCB data, and develop new tools using a variety of technologies • Heidi: gets new capabilities “for free” (more on that later)

  3. Some informatics terminology • An identifier is a short piece of text that identifies a digital object (e.g., “myfile.txt”) • An identifier can be resolved to the digital object by software that uses to identifier to access the object (e.g., printing the file called “myfile.txt”) • The scope (i.e., context) of an identifier is the set of conditions under which it can be resolved to one and only one object (e.g., a folder containing “myfile.txt”) • The global scope is the international community. Any other scope is local. • An identifier scheme is a format that specifies the syntax of identifiers (e.g., name + “.” + extension)

  4. Some terminology (cont.) • A namespace is an identifier for the scope of another identifier. (e.g., “edu” is a namespace that makes “whoi.edu” distinct from “whoi.org”) • Namespaces are generally appended to names as a prefix or suffix • Namespaces are generally used to transform local identifiers into global identifiers (e.g., via prefixing)

  5. Global vs. Local identifier schemes Local identifier schemes Global identifier schemes e.g., URL’s Standard Specified by standards bodies Exhaustively documented Data-independent Do not “collide” e.g., I cannot replace a web page at a URL you control • e.g., pathnames • Non-standard • Dependent on current software used • Generally undocumented • “Break” when data changes • May “collide” • e.g., my “data.csv” is a different file than your “data.csv”

  6. IFCB data acquisition flow • Seawater is sampled and forced through flow channel • Photomultiplier triggers many frame grabs (“ROI’s” or “targets”) • Data and imagery is written to a set of files • At end of sample, files are closed and new ones are opened

  7. Imaging FlowCytobot existing ID • Identifies a bin of observations, generally over an entire seawater sample • Contains the instrument number and UTC date/time • Used as a filename • Local identifier • Non-standard scheme • Non-standard resolution mechanism • Scope = all existing IFCB deployments and software (not so far off from global scope ) IFCB1_2011_234_052230

  8. Resolving IFCB identifiers to files \\ cheese.whoi.edu\J_IFCB\ifcb_data_MVCO_jun06\ IFCB1_2011_234\IFCB1_2011_234_052230.roi

  9. Is this pathname a global ID? No • It’s global, but it’s not a global identifier of an IFCB dataset; rather, it identifies a location on a file server • If the files are moved to a different server, share, or directory, the pathnames will change but the dataset will not • The .roi file represents the same dataset as the .adc and .hdr files, so those pathnames are different but do not identify a different dataset (uniqueness depends on exact matches, not partial matches) \\cheese.whoi.edu\J_IFCB\ifcb_data_MVCO_jun06\ IFCB1_2011_234\IFCB1_2011_234_052230.roi

  10. Proposed global ID scheme • Standard scheme (URL) • Identifies a single instrument, single time bin • Single ID per dataset (i.e., no extension) • No “day’s worth of data” directory (redundant) • Preserves existing local ID scheme (no need to generate new ID’s) • Works unmodified as a web page URL, XML tag name, or RDF resource http://ifcb-data.whoi.edu/IFCB1_2011_234_052230

  11. ID variant: a single ROI • Identifies a single observation (image + scattering data) • Observations are numbered sequentially in a time bin http://ifcb-data.whoi.edu/IFCB1_2011_234_052230_00031

  12. ID variant: a day’s worth of data • Prefix of existing identifiers • Acts as a namespace for each bin in that day • Note that the instrument number makes this per-instrument http://ifcb-data.whoi.edu/IFCB1_2011_234

  13. ID variant: an instrument’s data • All data from a given instrument • Metadata about the instrument http://ifcb-data.whoi.edu/IFCB1

  14. ID variant: a formatted representation • Extension added to global ID • Returns an XML representation of a bin’s worth of data • Includes metadata and links to individual ROI’s contained in that bin • Other formats available based on extension • HTML • RDF/XML (Resource Description Framework) • JSON (Javascript Object Notation) • JPEG / TIFF / PNG / etc. for ROI images http://ifcb-data.whoi.edu/IFCB1_2011_234_052230.xml

  15. Resolution of IFCB global ID’s request response Web Server (Apache) @ http://ifcb-data.whoi.edu GID requested representation (XML, JSON, RDF, jpg, tiff) endpoint mod_rewrite resolve.py?id=… convert.py path, format samba Windows file server @ \\cheese.whoi.edu memcached

  16. IFCB global ID resolution in action

  17. Interoperability: RSS feed of live data

  18. Interoperability: Android / iPhone

  19. Approach: leave data alone • Reuse as much of existing local ID scheme as possible • “Wrap” with global ID resolution backed by format conversion service • Do not require data to be reformatted and put in a repository for management • If data moves, point the services at the new location • If data format changes, tweak the format conversion service and reuse / extend provided representations • Clients using the ID resolution and format conversion service (e.g., manual annotation tool TBD, image processing workflow TBD) will be unaffected

  20. Roles of scientist vs. informaticist Joe the informaticist Heidi / Rob the scientists Answer questions Co-develop documentation of data formats Provide access to data Share existing data handling code Review ID scheme / representations • Ask questions • Co-develop documentation of data formats • Develop ID scheme • Develop resolution service • Develop representations and format service

  21. What did we just do? • Created long-term, global identifiers for IFCB data • Citable • “Actionable” (Kunze) = live URL’s • Can continue to be used in metadata and digital preservation packages even if they are no longer live URL’s • Prototyped services providing access to IFCB data in standard formats (XML, JSON, RDF) • Supports building web applications using HTML5 • Supports web service data access workflow modules • Provides a way to align to standard vocabularies and ontologies • And what is left to do (… on next slides)

  22. Additional issues to address • Timestamps only recorded in filenames (!) • Syringes with many ROI’s are split across multiple bins, and timestamp of observations in second bin must come from the filename of the first bin • No way to identify time series that use more than one instrument • MVCO time series involves IFCB1 being occasionally swapped with IFCB5 • No way to identify deployments generally • IFCB1 could be moved to a different location to sample plankton as part of a non-MVCO study; there is no way in this scheme to figure out which data goes with which study

  23. Next steps • Clients! • Manual annotation prototype (using HTML5 / AJAX and JSON format conversion) • MATLAB (retrofit existing code to use global ID’s) • Kepler (already supports fetching data from web services) • Improving next-generation IFCB’s data acquisition • Modify on-instrument code • Include timestamp in data (not just filenames) • Use ISO 8601 standard time formats • Generate column headers on CSV data • Record units of measure where appropriate • Align terms in IFCB data (e.g., “temperature”) with standard terms where appropriate

More Related