1 / 47

Data Management

Learn how to effectively manage multiple versions of data files, avoid confusion and maintain data integrity. Explore techniques such as version control, metadata management, and sensible archiving. Improve reproducibility and findability of your data.

sandyw
Download Presentation

Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Management

  2. How do we manage multiple versions of data files?

  3. samples.mat data So far, so good…

  4. samples.mat data new version Now what?

  5. samples.old.mat data samples.mat I guess this is alright?

  6. samples.old.mat data samples.old2.mat samples.mat Okay…

  7. One problem: generate_results.sh #!/usr/bin/env bash cat data/samples.mat \ | tr “^M”“\n” \ | sort -k 2,2nr \ | ... > results/samples.txt Which version of samples.mat was used to generate our results?

  8. In general, don’t move or rename files unless absolutely necessary. • makes it harder to reproduce results • makes it harder to find the data later • breaks scripts • breaks symbolic links (ln -s)

  9. But adding new filenames without structure isn’t necessarily any better…

  10. samples.mat data samplesV2.mat samplesFinal.mat samplesFinalV2.mat samplesUSE_THIS_ONE.mat Which one is the most recent?

  11. You could… look at the data and guess check the last-modified date And don’t ever move or copy the data!

  12. sounds like a job for…

  13. sounds like a job for… version control!

  14. sounds like a job for… version control! …or is it?

  15. Why shouldn’t we just use Subversion? Subversion is good for: • simultaneous modification • archive • merge • branch • diff

  16. Why shouldn’t we just use Subversion? Subversion is good for: • simultaneous modification • archive • merge • branch • diff what makes sense for data?

  17. Why shouldn’t we just use Subversion? Subversion is good for: • simultaneous modification • archive • merge • branch • diff what makes sense for data?

  18. What we need… a clear, stable, accessible way to archive data

  19. Performance reasons for not using version control on data • We can no longer removeor compress old and obsolete data files. • Checking out the repository means copying all your data files. • Many version control tools store text-based differences between versions of a file. For binary files, they essentially store a new copy of the file for each version.

  20. Common approach: samples.mat data samples.mat old samples.mat old2 • Archive old data into subdirectories: • maintains file names • keeps all data files in same “version” together • still moves files, so scripts and history now refer to different data

  21. Another common approach: samples.mat data samples.mat final samples.mat final2 samples.mat really-final

  22. Two big problems still: • What order do they go in? The naming can easily become ambiguous. • Hard to distinguish “top-level” data directories from archived or revised data directories. • Is data/saved/ an old version of all the data or a current version of “saved” data? Both are easily solved with a little thinking ahead

  23. From the start: samples.mat data 2010-09-28 samples.mat 2011-03-31 • “yyyy-mm-dd” is a nice date format since: • is easy to read (easier than yyyymmdd) • the alphabetic ordering is also chronological

  24. We’re still missing something big • We give a collaborator access to the data folder • A new student joins the project • It’s three years later and we’ve forgotten the details of the project

  25. We’re still missing something big • We give a collaborator access to the data folder • A new student joins the project • It’s three years later and we’ve forgotten the details of the project We need context. We need metadata.

  26. Metadata • who is the data from? • when was it generated? • what were the experiment conditions? • Header inside the file? (see Provenance essay) • Doesn’t work well with binary files • Separate metadata file for each data file? • Can quickly get out of sync or out of hand

  27. How I view metadata… A list of attributes that I might want to search by later

  28. samples.mat data 2010-09-28 samples.mat 2011-03-31 README README 2010-09-28 - batch1 - 25*C - 5 hrs Initial QC: samples 1328, 1422 poor quality 2011-03-31 - batch1 - 28*C - 10 hrs Initial QC: okay, but 1 min power failure @ t=2.5

  29. README data 2010-09-28 Many data files README 2011-03-31 Many data files README Additional READMEs can be necessary when you have many data files per directory

  30. * README data 2010-09-28 Many data files * README 2011-03-31 Many data files * README * Under version control

  31. * README data 2010-09-28 delete for space Many data files * README 2011-03-31 Many data files * README * Under version control

  32. * README data 2010-09-28 README should describe how to exactly reacquire data, if possible. * README 2011-03-31 Many data files * README * Under version control

  33. The big picture * README project README * yyyy-mm-dd Many data files data * Under version control

  34. The big picture * README project README * yyyy-mm-dd Many data files * * scripts, source code, etc data src * Under version control

  35. The big picture * README project README * yyyy-mm-dd Many data files * * scripts, source code, etc data src results from running scripts on data * Under version control experiments yyyy-mm-dd

  36. Why all the fuss? • Sensible archiving: • make it clear and obvious for someone else • “someone else” will likely be you, several months or years from now

  37. Why all the fuss? filenameExactlyAsDownloaded.tkq targets raw targets.gff Example: How exactly did I create targets.gff? This would be very difficult without additional information.

  38. Why all the fuss? The directory structure: filenameExactlyAsDownloaded.tkq otherNecessaryFile.dat targets raw process_targets.sh README targets.gff README - when and where from files were downloaded process_targets.sh - exact commands used to generate targets.gff from downloaded files

  39. Why all the fuss? The directory structure: filenameExactlyAsDownloaded.tkq otherNecessaryFile.dat targets raw process_targets.sh README targets.gff This made it both 1) possible and 2) easy to know exactly where the data came from and how it was processed.

  40. The right layout for the right project • No hierarchy is perfect for all projects. • Thinking hard at the beginning can save pain and suffering later on • For example…

  41. The right layout for the right project What we’ve talked about so far is great for projects with a strong separation between data and analysis: analysis0001 analysis0002 analysis0003 analysis0004 analysis0005 experiment0001 experiment0002 experiment0003 someOtherData

  42. The right layout for the right project But it doesn’t work as well for pipelines: parse align filter cluster

  43. The right layout for the right project One representation I have seen is: align cluster data 2010-09-28 filter parse align cluster 2011-03-31 filter parse

  44. The right layout for the right project But: I doesn’t make the pipeline ordering clear It doesn’t scale well with alternatives: parse align01 filter01 cluster filter02 cluster align02 filter01 cluster

  45. The right layout for the right project A better approach captures the dependency structure of the pipeline: samples.dat 2010-09-28 parse samples.dat align01 samples.dat filter01 samples.dat align02

  46. Summary • Think hard at the beginning • Version control metadata, not data files • An intelligent structure not only makes your data easier to archive, track, and manage, but also reduces the chance that the paths in the pipeline get crossed and the data out the end isn't what you think it is.

  47. created by Orion Buske • May 2011

More Related