1 / 22

Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos

Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos. Moorea Biocode Project, supported by the Gordon and Betty Moore Foundation Presentation by John Deck, University of California at Berkeley. Outline . Part 1: Background on Moorea Biocode Project

sileas
Download Presentation

Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and Betty Moore Foundation Presentation by John Deck, University of California at Berkeley

  2. Outline • Part 1: Background on Moorea Biocode Project • Part 2: bioValidator: field based data validation • Part 3: A case study in handling taxonomic names in a field based client application

  3. Part 1: Background on Moorea Biocode Project

  4. Moorea Biocode: The Collecting

  5. Moorea Biocode: Sorting Specimens The Sorting

  6. Moorea Biocode: Tissue Sampling

  7. Moorea Biocode LIMS: Binning, Trimming, & Assembly of Sequence Data

  8. Challenges Facing the Moorea Biocode Project IT Team • Multiple taxa & a different team for each group. Various cultures and workflow for each team. • Everyone in a hurry, non-technical biologists entering data • Specimens (& metadata) ultimately owned by multiple host institutions. • Multiple labs processing genetic data (w/ different equipment, processes, and workflows). • Final taxonomic determination made using Lab and/or Host Institution (Often much later than collecting event) • No internet or bad internet in field. • *Need to associate photos/standardized higher taxonomy in the field (before accession into any db)

  9. Field Based System Requirements • Spreadsheets for data entry • Extensible validation rules (each project or sub-project has its own requirements) • Match specimen data to Photos • Tag photos and load to external system (e.g. Flickr) • Query multiple taxonomic authorities (each TaxonTeam selects its own authority) • Updates online database periodically.

  10. Part 2: bioValidator: a Field based Data Validation Tool • Validate data using extensible validation rules • Search multiple taxonomies built in Lucene • Specimen to photo matching • Upload to Flickr using machine tags • No internet required • Java based

  11. Part 3: A case study in handling taxonomic names in a field based client application

  12. Why Lucene? • Java-based, cross platform • Indexes can be delivered to client apps (can run offline) • Ability to build a standardized interface to multiple taxonomies.

  13. Higher Taxonomic Name Handling in the Field • Initial Spreadsheet: Just assign the lowest taxon name and lowest taxon level. • bioValidator: Suggest a higher taxonomy based off name and level provided. • Revised Spreadsheet: update with suggested higher taxonomic hierarchy.

  14. Lucene Indexer Implementation for Taxonomy Example of Lucene Index built on ITIS String sql = "SELECT tsn from taxonomic_units”; … obtain resultset … while (resultset.next()) { Document doc = new Document(); // itisUnit is class that abstracts ITIS Schema itisRanksir = new itisRanks(resultset.getString("tsn”)); while (ir.next()) { doc.add(newField(ir.rank, ir.name)); } } IndexWriter.addDocument(doc);

  15. Lucene Search Implementation for Taxonomy Example of (a simplified!) Lucene Search: public static HashtablesearchIndex(StringtaxonLevel, String taxonName) { // Construct query Query query = new QueryParser(taxonLevel, taxonName); // Possible multiple matches TopFieldDocs hits = new IndexSearcher().search(query); // Loop through each taxonomic Unit for (inttaxonUnit = 0; i < hits.totalHits; taxonUnit++) { Document doc = searcher.doc(hits.scoreDocs[taxonUnit].doc); // Loop each rank to assign to map for (int rank = 0; rank < taxonLevels.getNumLevels(); rank++) { Object value = doc.get(taxonLevels.getLevel(rank)); // Populate a simple table with taxon ranks & values map.put(level, value); } } return map; }

  16. Further Work • Standardization in validation protocols (expand on CRIA work). As we push the envelop in field-based data collection this will become more of an issue. • Network of Lucene indexes for taxonomies? • GUID implementation in spreadsheets? • How to track and update data as it changes in dependent systems (LIMS Systems, Genbank, BOLD, CalPhotos). See BiSciCol Grant (NSF)

  17. More Information • John Deck (jdeck@berkeley.edu) • Moorea Biocode Project • http://mooreabiocode.org/ • bioValidator • http://biovalidator.sourceforge.net/ • bioTaxonomy (Lucene index/search) • http://biotaxonomy.sourceforge.net/

More Related