1 / 20

JSOC Pipeline Processing Overview

JSOC Pipeline Processing Overview. Rasmus Munk Larsen, Stanford University rmunk@quake.stanford.edu 650-725-5485. Overview. Hardware overview JSOC data model Pipeline infrastructure & subsystems Pipeline modules. JSOC Disk array. JSOC Connectivity. Stanford. DDS. NASA AMES . LMSAL.

umeko
Download Presentation

JSOC Pipeline Processing Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. JSOC Pipeline Processing Overview Rasmus Munk Larsen, Stanford University rmunk@quake.stanford.edu 650-725-5485

  2. Overview • Hardware overview • JSOC data model • Pipeline infrastructure & subsystems • Pipeline modules

  3. JSOC Disk array JSOC Connectivity Stanford DDS NASA AMES LMSAL 1 Gb Private line MOC “White” Net

  4. JSOC Hardware configuration

  5. JSOC data model: Motivation • Evolved from MDI dataset concept to • Enable record level access to meta-data for queries and browsing • Accommodate more complex data models required by higher-level processing • Main design features • Lesson learned from MDI: Separate meta-data (keywords) and image data • No need to re-write large image files when only keywords change (lev1.8 problem) • No out-of-date keyword values in FITS headers - can bind to most recent values on export • Data access through query-like dataset names • All access in terms of (sets of) data records, which are the “atomic units” of a data series • A datasetname is a query specifying a set of data records: • jsoc:hmi_lev1_V[#3000-#3020] (21 records from with known epoch and cadence) • jsoc:hmi_lev0_fg[t_obs=2008-11-07_02:00:00/8h][cam=‘doppler’] (8 hours worth of filtergrams) • Storage and tape management must be transparent to user • Chunking of data records into storage units for efficient tape/disk usage done internally • Completely separate storage unit and meta-data databases: more modular design • MDI data and modules will be migrated to use new storage service • Store meta-data (keywords) in relational database • Can use power of relational database to search and index data records • Easy and fast to create time series of any keyword value (for trending etc.) • Consequence: Data records must be well defined (e.g. have a fixed set of keywords)

  6. JSOC data model JSOC Data will be organized according to a data model with the following classes • Series: A sequence of like data records, typically data products produced by a particular analysis • Attributes include: Name, Owner , primary search index, Storage unit size, Storage group • Record: Single measurement/image/observation with associated meta-data • Attributes include: ID, Storage Unit ID, Storage Unit Slot# • Contain Keywords, Links, Data segments • Records are the main data objects seen by module programmers • Keyword: Named meta-data value, stored in database • Attributes include: Name, Type, Value, Physical unit • Link: Named pointer from one record to another, stored in database • Attributes include: Name, Target series, target record id or primary index value • Used to capture data dependencies and processing history • Data Segment: Named data container representing the primary data on disk belonging to a record • Attributes include: Name, filename, datatype, naxis, axis[0…naxis-1], storage format • Can be either structure-less (any file) or n-dimensional array stored in tiled, compressed file format • Storage Unit: A chunk of data records from the same series stored in a single directory tree • Attributes: include: Online location, offline location, tape group, retention time • Managed by the Storage Unit Manager in a manner transparent to most module programmers

  7. JSOC data model JSOC Data Series Data records for series hmi_lev1_fd_V Single hmi_lev1_fd_V data record Keywords: RECORDNUM = 12345 # Unique serial number SERIESNUM = 5531704 # Slots since epoch. T_OBS = ‘2009.01.05_23:22:40_TAI’ DATAMIN = -2.537730543544E+03 DATAMAX = 1.935749511719E+03 ... P_ANGLE = LINK:ORBIT,KEYWORD:SOLAR_P … hmi_lev0_cam1_fg hmi_lev1_fd_V#12345 aia_lev0_cont1700 hmi_lev1_fd_V#12346 hmi_lev1_fd_M hmi_lev1_fd_V#12347 hmi_lev1_fd_V Links: ORBIT = hmi_lev0_orbit, SERIESNUM = 221268160 CALTABLE = hmi_lev0_dopcal, RECORDNUM = 7 L1 = hmi_lev0_cam1_fg, RECORDNUM = 42345232 R1 = hmi_lev0_cam1_fg, RECORDNUM = 42345233 … hmi_lev1_fd_V#12348 aia_lev0_FE171 hmi_lev1_fd_V#12349 … hmi_lev1_fd_V#12350 hmi_lev1_fd_V#12351 hmi_lev1_fd_V#12352 Data Segments: V_DOPPLER = hmi_lev1_fd_V#12353 … Storage Unit = Directory

  8. JSOC subsystems • SUMS: Storage Unit Management System • Maintains database of storage units and their location on disk and tape • Manages JSOC storage subsystems: Disk array, Robotic tape library • Scrubs old data from disk cache to maintain enough free workspace • Loads and unloads tape to/from tape drives and robotic library • Allocates disk storage needed by pipeline processes through DRMS • Stages storage units requested by pipeline processes through DRMS • Design features: • RPC client-server protocol • Oracle DBMS (to be migrated to PostgreSQL) • DRMS: Data Record Management System • Maintains database holding • Master tables with definitions of all JSOC series and their keyword, link and data segment definitions • One table per series containing record meta-data, e.g. keyword values • Provides distributed transaction processing framework for pipeline • Provides full meta-data searching through JSOC query language • Multi-column indexed searches on primary index values allows for fast and simple querying for common cases • Inclusion of free-form SQL clauses allows advanced querying • Provides software libraries for querying, creating, retrieving and storing JSOC series, data records and their keywords, links, and data segments • Currently available in C. Wrappers (with read-only restriction?) for Fortran, Matlab and IDL are planned. • Design features: • TCP/IP socket client-server protocol • PostgreSQL DBMS • Slony DB replication system to be added for managing query load and enabling multi-site distributed archives

  9. Pipeline software/hardware architecture JSOC Science Libraries Utility Libraries Pipeline program “module” File I/O OpenRecords CloseRecords GetKeyword, SetKeyword GetLink, SetLink OpenDataSegment CloseDataSegment DRMS Library Data Segment I/O JSOC Disks JSOC Disks JSOC Disks Record Cache (Keywords+Links+Data paths) JSOC Disks DRMS socket protocol Data Record Management Service (DRMS) Data Record Management Service (DRMS) Storage unit transfer Storage Unit Management Service (SUMS) Data Record Management Service (DRMS) AllocUnit GetUnit PutUnit Storage unit transfer SQL queries Robotic Tape Archive Database Server SQL queries SQL queries Record Catalogs Record Catalogs Series Tables Record Tables Storage Unit Tables

  10. JSOC Pipeline Workflow Pipeline processing plan Pipeline Operator DRMS session Module3 Processing script, “mapfile” List of pipeline modules with needed datasets for input, output PUI Pipeline User Interface (scheduler) Module2 Processing History Log Module1 DRMS Data Record Management service DRMS Data Record Management service SUMS Storage Unit Management System

  11. Analysis modules: co-I contributions and collaboration • Contributions from co-I teams: • Software for intermediate and high level analysis modules • Data series definitions • Keywords, links, data segments, size of storage units, primary index keywords etc. • Documentation • Test data and intended results for verification • Time • Explain algorithms and implementation • Help with verification • Collaborate on improvements if required (e.g. performance or maintainability) • Contributions from HMI team: • Pipeline execution environment • Software & hardware resources (Development environment, libraries, tools) • Time • Help with defining data series • Help with porting code to JSOC API • If needed, collaborate on algorithmic improvements, tuning for JSOC hardware, parallelization • Verification

  12. Code developed at Stanford Code developed at HAO Standalone “production” code routinely used MDI pipeline modules exist Research code in use HMI module status and MDI heritage Intermediate and high level data products Primary observables Internal rotation Heliographic Doppler velocity maps Spherical Harmonic Time series Mode frequencies And splitting Internal sound speed Full-disk velocity, sound speed, Maps (0-30Mm) Local wave frequency shifts Ring diagrams Doppler Velocity Carrington synoptic v and cs maps (0-30Mm) Time-distance Cross-covariance function Tracked Tiles Of Dopplergrams Wave travel times High-resolution v and cs maps (0-30Mm) Egression and Ingression maps Wave phase shift maps Deep-focus v and cs maps (0-200Mm) Far-side activity index Stokes I,V Line-of-sight Magnetograms Line-of-Sight Magnetic Field Maps Stokes I,Q,U,V Full-disk 10-min Averaged maps Vector Magnetograms Fast algorithm Vector Magnetic Field Maps Vector Magnetograms Inversion algorithm Coronal magnetic Field Extrapolations Tracked Tiles Tracked full-disk 1-hour averaged Continuum maps Coronal and Solar wind models Continuum Brightness Solar limb parameters Brightness feature maps Brightness Images

  13. Example: Global Seismology Pipeline

  14. Questions to be discussed at working sessions • List of standard science data products • Which data products, including intermediate ones, should be produced by JSOC to accomplish the science goals of the mission? • What cadence, resolution, coverage etc. should each data product have? • Which data products should be computed on the fly and which should be archived? • What are the challenges to be overcome for each analysis technique? • Detailing each branch of the processing pipeline • What are the detailed steps in each branch? • Can some of the computational steps be encapsulated in general tools that can be shared among different branches (example: tracking)? • What are the CPU and I/O resource requirements of computational steps? • Contributed analysis modules • What groups or individuals will contribute code, and incorporate it in the pipeline? • If multiple candidate techniques and/or implementations exist, which should be included in the pipeline? • What is the test plan and what data is needed to verify the approach?

  15. JSOC Series Definition

  16. Global Database Tables

  17. Database tables for example series hmi_fd_v • Tables specific for each series contain per record values of • Keywords • Record numbers of records pointed to by links • DSIndex = an index identifying the SUMS storage unit containing the data segments of a record • Series sequence counter used for generating unique record numbers

  18. Pipeline batch processing • A pipeline batch is encapsulated in a single database transaction: • If no module fails all data records are commited and become visible to other clients of the JSOC catalog at the end of the session • If failure occurs all data records are deleted and the database rolled back • It is possible to commit data produced up to intermediate checkpoints during sessions Pipeline batch = atomic transaction Module 2.1 Module N Commit Data & Deregister Module 1 Register session … DRMS API DRMS API DRMS API DRMS API DRMS API Module 2.2 DRMS API Input data records Output data records DRMS Service = Session Master Record & Series Database SUMS

  19. Example of module code: • A module doing a (naïve) Doppler velocity calculation could look as shown below • Usage: doppler DRMSSESSION=helios:33546 "2009.09.01_16:00:00_TAI" "2009.09.01_17:00:00_TAI" extern CmdParams_t cmdparams; /* command line args */ extern DRMS_Env_t *drms_env; /* DRMS environment */ int module_main(void) { DRMS_RecordSet_t *filtergrams, *dopplergram; int first_frame, status; char query[1024],*start,*end; start = cmdparms_getarg(&cmdparams, 1); end = cmdparms_getarg(&cmdparams, 2); sprintf(query, "hmi_lev0_fg[T_Obs=%s-%s]", start, end); filtergrams = drms_open_records(drms_env, query, "RD", &status); if (filtergrams->num_recs==0) { printf("Sorry, no filtergrams found for that time interval.\n"); return -1; } first_frame = 0; /* Start looping over record set. */ for (;;) { first_frame = find_next_framelist(first_frame, filtergrams); if (first_frame == -1) /* No more complete framelists. Exit. */ break; dopplergram = drms_create_records(drms_env, "hmi_fd_v", 1, &status); if (status) return -1; compute_dopplergram(first_frame, filtergrams, dopplergram); drms_close_records(drms_env, dopplergram); } return 0; }

  20. Example continued int compute_dopplergram(int first_frame, DRMS_RecordSet_t *filtergrams, DRMS_RecordSet_t *dopplergram) { int n_rows, n_cols, tuning; DRMS_Segment_t *fg[10], *dop; short *fg_data[10]; char *pol; double *dop_data; /* Get pointers for doppler data array. */ dop = drms_open_datasegment(dopplergram->records[0], "v_doppler", "RDWR"); n_cols = drms_getaxis(dop, 0); n_rows = drms_getaxis(dop, 1); dop_data = (double *)drms_getdata(dop, 0, 0); /* Get pointers for filtergram data arrays. */ for (i=first_frame; i<first_frame+10; i++) { fg[i] = drms_open_datasegment(filtergrams->records[i], "intensity", "RD"); fg_data[i] = (short *)drms_getdata(fg, 0, 0); pol = drms_getkey_string(filtergrams->records[i], "Polarization"); tuning = drms_getkey_int(filtergrams->records[i], "Tuning"); printf(“Using filtergram (%s, %d)\n”, pol, tuning); } /* Do the actual Doppler computation.*/ calc_v(fg_data, dop_data); }

More Related