1 / 11

ROVER Hardening Data Delivery by the internet

Collecting terabytes of data from FDSN data centers is possible but challenging. ROVER is a command-line client that runs long-term, verifies requested data retrieval, and builds a data index for easy integration into workflows.

rachelhicks
Download Presentation

ROVER Hardening Data Delivery by the internet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ROVER Hardening Data Delivery by the internet

  2. The challenge and motivation Collecting X terabytes of arbitrary data from the FDSN data centers is possible but: • Usually only possible by partitioning the request, orchestrated by user • Orchestration is non-trivial, need to deal with errors & re-tries • Complete? Downloads may be quietly truncated, a weakness of HTTP + streaming • Local data management • Summarization, indexing and sub-setting are all left to the users to (re)invent

  3. Enter rover Retrieval of Various Experiment data Robustly • A command-line client • Designed to run long-term, until the request is complete (restartable) • Designed to verify that all requested data that can be retrieved has been retrieved • Using the DMC’s availability service • Designed to check for additions and, in the future, updates to requested data set • Builds a data index: for summarization, lookup and extraction • Index is in SQLite, support is ubiquitous. Simple text summaries trivially generated. • Index is the key for integrating such a data set into a workflow & a bridge to other systems.

  4. Rover workflow 2launch retrieval per request, loop until nothing to retrieve 1 Create desired data request Subscription 1 Subscription 1 2d index data Request 1..N 2a check availability Data index 2b compare to local holdings Data set (miniSEED) 2c fetch needed data in parallel

  5. How to install Two part installation: 1) Install mseedindex from source code: https://github.com/iris-edu/mseedindex Requirements: C compiler and make program 2) Install rover using pip: > pip install rover Requirements: Python >= 2.7 (and pip)

  6. Rover: Quick Start, an example request $ rover init-repository datarepo $ cd datarepo 1. Initialize a data repository (and change into that directory) IU ANMO * LHZ 2012-01-01T00:00:00 2012-02-01T00:00:00 TA MSTX -- BH? 2012-01-01T00:00:00 2012-02-01T00:00:00 2. Create a request file named request.txt containing: $ rover retrieve request.txt 3. Run rover retrieve to fetch these data: * HTTP status & email when done <datarepo>/data/<network>/<year>/<day>/<station>.<network>.<year>.<day> Data are saved, in miniSEED format, to files with this organization:

  7. Once you have data Report what is in the repository $ rover list-summary IU_ANMO_00_LHZ 2012-01-01T00:00:00.069500 2012-01-31T23:59:59.069500 IU_ANMO_10_LHZ 2012-01-01T00:00:00.069500 2012-01-31T23:59:59.069500 TA_MSTX__BHE 2012-01-01T00:00:00.000000 2012-01-31T23:59:59.975000 TA_MSTX__BHN 2012-01-01T00:00:00.000000 2012-01-31T23:59:59.975000 TA_MSTX__BHZ 2012-01-01T00:00:00.000000 2012-01-31T23:59:59.975000 List a summary (extents) of data in the repository • Limit summary to specific networks, stations, locations, channels & time ranges • Alternatively, use list-index for full details: actual contiguous traces

  8. Once you have data Run your own fdnsws-dataselect service Run an FDSN web service on your local repository: https://iris-edu.github.io/portable-fdsnws-dataselect/ • Python-based web service that returns data based on a time series index • Most tools that use FDSN web services (FetchData, ObsPy, etc.) can be redirected to alternate services

  9. Once you have data Direct use with ObsPy (next release) The DMC has contributed a new sub-module to ObsPy, which will be included in the next release, that allows directly discovering and reading of data in a rover-created repository: obspy.clients.filesystem.tsindex.Client Very similar to other ObsPy interfaces, this module provides: get_waveforms() get_availability_extent() get_availability() and a few more.

  10. Once you have the data Use the data index directly The data index: for data discovery and summary, no need to crawl through files • Filenames, data identifiers (net, sta, loc, chan), earliest, latest, exact segments, sample rates, low level details and more... Index is stored in SQLite, a very powerful single file database, but easy to use! $ sqlite index.sql 'SELECT filename,network,station,location,channel,starttime,endtime FROM tsindex;' /path/cola.mseed|IU|COLA|00|LH1|2010-02-27T06:50:00.069539|2010-02-27T07:59:59.069538 /path/cola.mseed|IU|COLA|00|LH2|2010-02-27T06:50:00.069539|2010-02-27T07:59:59.069538 /path/cola.mseed|IU|COLA|00|LHZ|2010-02-27T06:50:00.069539|2010-02-27T07:59:59.069538 ...

  11. Main take away points • Addresses robust collection of small to large data sets • Providing an index data repository • Cost: learning a new tool • Expected release: Spring 2019 • Ask if you would like to be an early tester! • See a demo at IRIS booth (808)

More Related