a scalable workflow scheduling and gridding of calipso lidar infrared data n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
A Scalable Workflow Scheduling and Gridding of CALIPSO Lidar /Infrared Data PowerPoint Presentation
Download Presentation
A Scalable Workflow Scheduling and Gridding of CALIPSO Lidar /Infrared Data

Loading in 2 Seconds...

play fullscreen
1 / 23

A Scalable Workflow Scheduling and Gridding of CALIPSO Lidar /Infrared Data - PowerPoint PPT Presentation


  • 114 Views
  • Uploaded on

A Scalable Workflow Scheduling and Gridding of CALIPSO Lidar /Infrared Data. PI: Prof. Yelena Yesha and Prof. Milton Halem Sponsored by NASA Presented by: Phuong Nguyen and Frank Harris IAB Meeting Research Report Dec 18, 2012. Project objectives.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

A Scalable Workflow Scheduling and Gridding of CALIPSO Lidar /Infrared Data


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a scalable workflow scheduling and gridding of calipso lidar infrared data

A Scalable Workflow Scheduling and Gridding of CALIPSO Lidar/Infrared Data

PI: Prof. Yelena Yesha and Prof. Milton Halem

Sponsored by NASA

Presented by: Phuong Nguyen and Frank Harris

IAB Meeting Research Report

Dec 18, 2012

project objectives
Project objectives
  • Make use of the a scalable workflow scheduling system developed by CHMPR/MC2 (implemented on top of Hadoop) on a real Big Data scientific use case
    • perform analysis of global climate change from decadal satellite data infrared radiance records stored in two distinct archives obtained from AIRS and MODIS instruments.
    • perform gridding and subsequent monthly, seasonal and annual trend inter-comparisons with Surface Temperatures from ground station records and compare with model output reanalysis.
  • Gridding other satellite data such as CALIPSO Lidar aerosols and delivery gridded data products
a scalable workflow scheduling system1
A Scalable Workflow Scheduling System
  • We have developed a A Scalable Workflow Scheduling and Implemented as a workflow system on top of Apache Hadoop
    • Expresses and dynamically schedules parallel data intensive workflow computations:
      • data flows in Directed Acyclic Graph rather than control flow
      • optimizes the level of concurrency
      • shares cluster resources using fine grain scheduling (HybS)
      • support scientific data format (e.g HDF) and computation using float arrays
      • performance predictive model
  • Available JAVA APIs: DagJob, DagBuilder, Graphs … and
  • Libraries: gridding, statistic routines, statistic model
  • Available HybSHadoop plug in scheduler – configurable to work as Hadoop scheduler in current Hadoop distribution 1.0.1
use case global climate changes from airs and modis
Use case: global climate changes from AIRS and MODIS
  • Atmospheric Infrared Sounder (AIRS)
    • 14 - 40km Footprint
    • 2378 IR Spectral Channels
    • 5.5 TB / year (L1B)
    • 55 Terabytes; 876,000 HDF files, each file 135x90x2378 (28,892,700 elements) 60MB
  • Moderate Resolution Imaging Spectroradiometer (MODIS)
    • 1 - 4km Footprint (Infrared)
    • 16 IR Spectral Channels
    • 17 TB / year (L1B)
    • 170 Terabytes; 1,051,200 HDF files
    • Produces data product 10 year AIRS FCDR anomalies At 0.50x1.00 lon-lat (100km) from 2002-2012
airs gridding using mapreduce approaches
AIRS gridding using MapReduce approaches
  • Step 1: Parallel upload AIRS/ MODIS HDF files from NSF/PVFS into Hadoop HDFS
  • Step2: Run gridding AIRS/MODIS using MapReduce jobs Output written to HBASE tables
  • Step3:
    • Analysis on gridded data from HBASE tables or
    • Loading data out of HBASE/HDFS to store HDF files in NFS/PVFS for other analysis
  • Gridding using MapReduce
    • input for Map function a HDF file and output (key, value). key grid cell (latxlon) value is array of sum and count of radiances for all spectral channel
    • Reduce function avg all values with the same key and output into Hbase tables
spatial data locality
Spatial data locality
  • Bounding box is implemented
  • Reduce local before shuffle
  • Output stores in Hbase tables for queries e.g monthly, seasonal and annual trend inter-comparisons

Image source: David Chapman

improvement of airs gridding
Improvement of AIRS gridding
  • Estimated based on daily gridding
  • Used bounding box for spatial data locality
  • Gridding: compare with regular method, embarrassing parallel, gain 35% improvement in total processing time
  • Benefits: scaling, failure handling, gridding at high resolutions, queries by random data access on Hbase tables.
gridding calipso lidar aerosols background
Gridding CALIPSO Lidar aerosolsBackground

Cloud-Aerosol LIDAR Infrared Pathfinder Satellite Observations (CALIPSO) is an Infrared/Lidar satellite, joint project between NASA and CNES (France)

Fourth satellite in the A-train formation, follows CloudSat by 15 s, and Aqua by 165 s

Launched in 2006

instruments
Instruments

Cloud-Aerosol Lidar with Orthogonal Polarization (CALIPSO)

Detects reflectance of 20 ns laser pulses at 1064 nm (IR) and 532 nm (vis)

333 m footprints at full spatial resolution

Imaging Infrared Interferometer (IIR)

Provides a 3-channel infrared product at 8.65, 10.6, and 12.05 μm at 1 km spatial resolution

Wide Field Camera (WFC)

1-channel visible product at 1 km resolution

progress
Progress

Developed serial gridder in C, tested on subset of IIR data

Acquired 14 months of IIR data, 333 days, average 1.5 GB per day, total ~ 500 GB

In addition, 2 months of CALIPSO data downloaded, for a total of 625 GB, for a total of 3.7 TB/year

gridded product
Gridded Product

Full 360x180 degree image

  • At full-resolution, image is 36000x18000 pixels and 2.4 GB in size
  • Shows expected swath path for sun-synchronous satellite
  • Shows limited coverage of

nadir imaging

Subset of gridded image

  • Shows high detail within individual swaths
  • Also shows significant moiré interference as a result of my gridding algorithm
  • Plan to improve gridding via inverse distance weighting interpolation in the near future
what s next
What's Next

Acquire rest of dataset (3 TB IIR, 22 TB CALIOP)

When naïve sequential approach done, process using map-reduce

Interference, sparse coverage and file size problems can be dealt with by significantly lowering resolution of product to 1°x1°

Use NCAR Graphics library instead of reusing built-from-scratch internal code

Produce gridded products, monthly and yearly averages

Possible scientific applications: Solar reflectances to generate cloud maps, using altimetry data from CALIOP as correction for existing datasets

project status
Project Status
  • Have developed AIRS and MODIS gridding and analysis using MapReduce approaches (make use of the workflow system)
  • Showed gridding CALIPSO using serial approach
  • Future work
    • work on gridding CALIPSO using the MapReduce approach
    • test, evaluate and produce data products
    • Phuong Nguyen Working on Open source workflow system and HybsHadoop plug-in scheduler.
publications
Publications
  • Phuong Nguyen, Tyler Simon, Milton Halem, David Chapman, and Quang Le, "A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment”, The 5th IEEE/ACM International Conference on Utility and Cloud Computing 2012
  • Phuong Nguyen, David Chapman, Jeff Avery, and Milton Halem, “A near fundamental decadal data record of AIRS Infred Brightness Temperatures” IEEE Geoscience and Remote Sensing Symposium 2012
  • Phuong Nguyen PhD dissertation “Data intensive scientific compute model for multiple core clusters” submitted to UMBC Dec 3, 2012
  • Phuong Nguyen, Milton Halem,“AMapReduce Workflow System for Architecting Scientific Data Intensive Applications”, ACM International Workshop on Software Engineering for Cloud Computing proceeding of ICSE 2011
difference between scientific workflows and business workflows

UCL Department of Computer Science

Difference between Scientific Workflows and business workflows
  • BPEL is primarily targeted at business workflows
  • Scientific workflows differ in a number of ways
  • The main difference is one of scale along several dimensions
background mapreduce hadoop
Background: MapReduce/Hadoop
  • Distributed computation on large cluster
  • Each job consists of Map and Reduce tasks
  • Job stages
    • Map tasks run computations in parallel
    • Shuffle combines intermediate Map outputs
    • Reduce tasks run computations in parallel

M

R

M

M

R

R

M

M

Source slide: Brian Cho UIUC

background mapreduce hadoop1
Background: MapReduce/Hadoop
  • Distributed computation on large cluster
  • Each job consists of Map and Reduce tasks
  • Job stages
    • Map tasks run computations in parallel
    • Shuffle combines intermediate Map outputs
    • Reduce tasks run computations in parallel
  • Map input/Reduce output stored in distributed file system (e.g. HDFS)
  • Scheduling: Which task to run on empty resources (slots)

Job 1

Job 3

M

R

M

M

R

R

R

M

R

R

M

M

M

R

M

M

R

M

M

M

M

M

M

M

Job 2

Source slide: Brian Cho UIUC

why new workflow scheduling system
Why new workflow scheduling system?
  • Characteristics of data intensive scientific apps
    • Repeat experiments on the different data
    • Computations on high dimension arrays: spatial, temporal, spectral
    • Variety of data formats, need math libraries
    • Complex components e.g. model prediction
  • Lack of a scientific workflow system to deal with scale
    • scalability, reliability, scheduling, data management, provenance, low overhead
  • Current limitation: Hadoop is a scalable systems built to run at large scales (e.g. runs on 8000 cores) commodity clusters
    • Still need to improve key performance metrics
    • Limited support for scientific apps
hbase design for multiple satellites gridded data
HBASE design for multiple satellites: gridded data
  • (Table, RowKey, Family, Column, Timestamp) → Value
  • Hbase Index on rowKey value
  • The rowKey design for multiple satellite instruments
      • <InstrumentID>_ <DateTime>_<SpectralChannel>_<Spatial Index>
      • Column families
        • e.g. Resolution-Statistics column1_100km, column_1Km. Spatial Index lat, lon bounding box
  • Index by Instruments, Date Time, Spatial Index and Spectral Channel
  • Scan rows (which columns) into MapReduce computation

Scan into MapReduce computation