slide1 n.
Download
Skip this Video
Download Presentation
The Python Programming Language and HDF5: H5py.

Loading in 2 Seconds...

play fullscreen
1 / 25

The Python Programming Language and HDF5: H5py. - PowerPoint PPT Presentation


  • 176 Views
  • Uploaded on

The Python Programming Language and HDF5: H5py. Application to Satellite Remote Sensing Data Processing. Needs of satellite remote sensing data processing and programming languages. Introduction to some Python elements. Introduction to Numerical Python with Numpy.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The Python Programming Language and HDF5: H5py.' - vaughn


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

The Python Programming Language and HDF5: H5py.

Application to Satellite Remote Sensing Data Processing.

Needs of satellite remote sensing data processing and programming languages.

Introduction to some Python elements.

Introduction to Numerical Python with Numpy.

Tying them together with H5py.

Examples.

Daniel Kahn, Science Systems and Applications Inc.

HDF/HDF-EOS Workshop XIII, Riverdale, MD

3 Nov 2009

slide2

Where credit is due:

Andrew Collette, UCLA, author of h5py

Francesc Alted, author of PyTables and the HDF5 1.6 Pyrex definitions (i.e Python-C defs used in H5py)‏

Andrew also requested acknowledgment for Darren Dale (Cornell) and Armando Sole (ESRF) for their suggestions, patches and encouragement.

slide3

Modern remote sensing satellite data and data processing needs have several characteristics.

1) The data need to be subjected to numerical algorithms. The algorithms vary in their sophistication but few, if any, require supercomputer architectures. E.g. Computing a zonal mean. (Fortran, C, C++)‏

2) The data need to be subjected to non-numerical “bookkeeping” algorithms. Associated metadata needs to be read, matched, sliced, diced, acted upon and committed to databases. (Perl)

3) Modern systems automatically the schedule processing on to lots of inexpensive, commodity CPUs. This complicates using managed (i.e. licensed) commercial runtime environments, e.g. IDL, Matlab, Mathematica.

4) Satellite processing systems are dynamic. They need customization several times over the course of each instrument's life. Development time needs to be short. (Perl, IDL, Matlab, Mathematica)‏

5) Data processing systems need to track data provenance.

slide4

A Remote Sensing Data Processing Challenge...

Problem:

We lack a programming language which spans the breath of our remote sensing data processing needs, i.e. Which can handle both numerical and non-numerical processing (including HDF5) and that has a short development cycle.

Goal:

Find a programming language which allows fast development and is adept at numerical processing as well as non-numerical processing. “Adept” doesn't mean “best”, just pretty good.

slide5

...and the Python Programming Language

Why Python fits the bill better than:

Fortran

As an interpreted language with dynamic variable bindings Python offers more flexibility during development. The multidimensional array extension (numpy) provides efficient and competitive array operations. It is more similar to IDL or Matlab...

IDL or Matlab

No license required which can save money on several levels.

Python language can be more well suited to non-numerical data processing tasks such as are encountered in production processing of remote sensing. It is more similar to Perl...

Perl/PDL

Nicer syntax for array operations.

More complete HDF5 support (h5py vs PDL HDF5).

slide6

Some elements of Python

Python is an interpreted language. Python statements can easily be run from an interactive prompt. Python programs are written in ASCII files.

Python has many of the same elements as programming languages used for numerical processing (I.e. loops, conditionals, etc) plus...

Name spaces

Lists

Dictionaries

slide7

Python has name spaces

Name spaces allow related functions and data to be grouped together in a way that will not conflict with unrelated functions and data which happens to be named the same.

Name space are very useful, but are only relevant here because Python name spaces correspond to Python modules. The numerical and HDF5 routines we will see later are implemented as Python modules.

For example, to open an HDF5 file in Python we first import the h5py module and then we open the file using the File function of the h5py module.

import h5py # First import the module.

FileID = h5py.File('hdf5_test.h5','r')#Now use the File function,

#note the module name in red.

.

.

.

We could import other modules simultaneously which have an unrelated function called File and there would be no conflict or error.

slide8

Python has Lists:

These are 1D arrays which are useful for solving many problems.

List index notation looks like 1D array notation in many languages.

>>> MyList = [1.0,70] # initialize variable to 2 element list

>>> MyList.append("Demo Data!") # Append an element, NOT like fortran

>>> MyList.append(MyList[0] + MyList[1]) # Append sum of first 2 elements

>>> print MyList

[1.0, 70, 'Demo Data!', 71.0]

Python has Dictionaries:

(Aka hash tables) These map symbols to objects in analogy to how a 1D list maps an integer index value to an object.

>>> MyDictionary = {} # Initialize empty Dictionary

>>> MyDictionary[0] = 1

>>> MyDictionary['Gadzooks'] = 70

>>> MyDictionary['Kind of Data'] = 'Demo Data!'

>>> MyDictionary['Result'] = MyDictionary[0] + MyDictionary['Gadzooks']

>>> print MyDictionary

{0: 1, 'Kind of Data': 'Demo Data!', 'Result': 71, 'Gadzooks': 70}

slide9

Python Lists vs Python Dictionaries vs Arrays in Fortran

Index to a List must be an integer, but a Dictionary can be indexed by an integer or string.

MyList[Integer] vs. MyDictionary[Integer or String]

The objects referenced by a List or Dictionary can be any Python object. This is in contrast to more traditional languages, eg. The elements Fortran array must be of a particular type such as real*4.

MyList = [34,”String Datum”]

or

MyDictionary={'FirstKey':34,'SecondKey':”String Datum”}

slide10

The List and Dictionary data structures make it possible to write flexible programs very quickly. However, they are not good for numerical computation. The variety of objects (number, string, etc) which they can reference makes computation slow.

For example, adding array elements: MyList[0] + MyList[1]

Python must check at run time if MyList[0] and MyList[1] are objects that can be added together, and not, say, a number and a string. In a loop this check is performed at every iteration!

What to do?

Enter Numpy

Numpy is a package for use with Python which provides multidimensional arrays of numeric (and other) types and extends Python syntax to provide a familiar interface to the arrays.

Numpy extends Python syntax to allow the expression of vector operations similar to those in IDL, Matlab, or Fortran (>=90).

slide11

Numpy Example: Create, Print, and add two 2D arrays

Build from Python lists of lists of elements (the module name is numpy)...

>>> a = numpy.array([[1,2,3],[4,5,6]])‏

>>> print a

[[1 2 3]

[4 5 6]]

Build from dimension sizes...

>>> b = numpy.ones([2,3])‏

>>> print b

[[ 1. 1. 1.]

[ 1. 1. 1.]]

Print a selected element...

>>> print a[1,2]

6

Example: Add two dimensional arrays.

>>> print a+b

[[ 2. 3. 4.]

[ 5. 6. 7.]]

>>>

slide12

The HDF5 connection: H5py

H5py is an Python-HDF5 interface is a Python module written by Andrew Collette. Its design allows the use of familiar Python structures for working with HDF5 files.

The interface maps Python syntax for familiar Python objects to similar objects in the HDF5 file.

slide13

Here is a familiar example HDF5 file from the HDFView distribution:

Here is how to read the 3D int array using h5py.

>>> import h5py

>>> fid = h5py.File('hdf5_test.h5','r')‏

>>> group = fid['arrays']

>>> The3DArray = group['3D int array'].value

>>> fid.close()‏

>>> The3DArray

array([[[ 174, 27, 0, ..., 102, 71, 100009],

[ 171, 27, 0, ..., 194, 79, 100109],

[ 172, 27, 0, ..., 102, 55, 100209],

...,

slide14

Equivalence of HDF5 Groups and Python Dictionaries

Print value of dictionary entry:

>>> MyDictionary = {'RandArray':numpy.random.random([2,2])}

>>> print MyDictionary['RandArray']

[[ 0.82066938 0.39219683]

[ 0.86546443 0.91276533]]

Print value of HDF5 file entry:

>>> fid = h5py.File('RandomArray.h5','r')‏

>>> print fid['RandArray'].value

[[ 0.1 3.14152908]

[ 2.71828008 0. ]]

slide15

Simple Real world example:

Goal: Retrieve HDF5 file from Configuration Management (CM) and insert CM metadata into HDF5 file.

We want to place the CM path and revision number inside the HDF5 file as an attribute.

slide16

Python script to retrieve file from CM and store Rev number as attribute.

#! /usr/bin/env python

import sys

import os

import h5py

Rev = sys.argv[1] # Specifiy CM path on command line

SVNFilepath = sys.argv[2] # Specify revision number on #comand line.

command = 'svn export -r ' + Rev + ' ' + SVNFilepath #Subversion

# Command

InStream = os.popen(command,'r')‏

ExportString = InStream.read()‏

ExportReturnCode = InStream.close()‏

Elements = SVNFilepath.split('/')‏

# HDF5 code

fid = h5py.File(Elements[-1]) # Elements[-1] is file name

fid.attrs['SVN Path and Revision'] = SVNFilepath + '@' + Rev

fid.close()‏

H5py code in red. Note the minimal effort coding HDF5 calls.

slide17

Another real world example (if NPOESS is your real world).

We have had several occasions to do data aggregation on HDF5 files for the OMPS Limb instrument.

Our retrieval code (Fortran) processes an orbit of data as 480 distinct pieces and places the results into 480 distinct HDF5 files. We wish to aggregate the files such that an N dimensional array in a unaggregated file becomes a N+1 dimensional array in the aggregated HDF5 file.

The Fortran code is (mostly) the product of the retrieval scientist, while the aggregation is a requirement of the production data processing system. It makes sense to aggregate as a post-processing step to the Fortran code so as to minimize changes to the original Fortran code.

slide18

Aggregation Algorithm

1) Input is list of HDF5 files

2) Analyze structure of one file to generate list of fully qualified dataset names,

dimensions, and type (In the code I use a Python Dictionary and not a list).

3) Assume all files have that structure. Read corresponding datasets (of dim N) from each file into aggregation variable (of dim N+1).

4) After corresponding datasets have been read from all files write aggregation variable to HDF5 file.

5) Repeat 3 until all datasets have been aggregated.

Array 4

Array 3

File 1 Array

File 2 Array

File 3 Array

File 4 Array

Array 2

Array 1

+

+

+

Schematic of Array Aggregation for One Dataset

slide19

# Data Aggregator. This script takes a set of HDF5 files as input.

# The files are expected to have identical struture. All the

# corresponding arrays in the input files are combined into an array

# which has dimensions N+1, where N is the number of dimensions in the

# original, constituent arrays.

import sys

import h5py

import numpy

Files = sys.argv[1:] # Get file names from command line

# First step is to select one file and create a map of the shape and

# data type of all datasets. This is naturally done via a recursive

# function, called VisitAllObjects

FirstFid = h5py.File(Files[0],'r') # Open first HDF5 file

FileInfo = {} # Initialize FileInfo to be a dictionary. We will use it to build a mapping from

# dataset name to a tuple containing shape and type of dataset.

# EVALUATE HDF5 HIERARCHY

# Evaluating a a hierarchy is naturally a recursive process so we define a function....

def VisitAllObjects(Group,Path):

for i in Group.items():

if isinstance(i[1],h5py.Group):

VisitAllObjects(i[1],Path + '/' + i[0])‏

else:

DatasetName = Path + '/' + i[0]

FileInfo[DatasetName] = (Group[DatasetName].shape, Group[DatasetName].dtype)‏

VisitAllObjects(FirstFid,'')‏

FirstFid.close()‏

slide20

# Print dataset paths and info to screen

for (k,v) in FileInfo.items():

print k,v

# AGGREGATE DATA

# Now that we understand the file structure we can perform the aggregation.

OutputFileID = h5py.File('AggregatedData.h5','w')‏

NumberOfFiles = len(Files)‏

# Here is the meat of the code. The outer loop is over datasets, the inner over all files.

for Dataset in FileInfo.keys():

AggregatedData = numpy.ndarray(FileInfo[Dataset][0]+(NumberOfFiles,),dtype = FileInfo[Dataset][1])‏

for FileNumber in range(NumberOfFiles):

# Open file, read data into aggregation array, and close

fid = h5py.File(Files[FileNumber],'r')‏

AggregatedData[...,FileNumber] = fid[Dataset].value

fid.close()‏

Path = Dataset.split('/')‏

map((lambda(x): OutputFileID.require_group(x)), Path[1:-1])‏

#OutputFileID[Dataset] = AggregatedData

OutputFileID.create_dataset(Dataset,data=AggregatedData,compression=5,chunks=FileInfo[Dataset][0]+(1,))‏

OutputFileID.close()‏

slide21

Original, Unaggregated Data Field

$ h5dump -H -d /ANCILLARY_DATA/GeopotentialHeight_NCEP \

Data/OMPS_LP_SDR_20041121_55_146_-69_-119.h5

HDF5 "Data/OMPS_LP_SDR_20041121_55_146_-69_-119.h5" {

DATASET "/ANCILLARY_DATA/GeopotentialHeight_NCEP" {

DATATYPE H5T_IEEE_F32LE

DATASPACE SIMPLE { ( 5, 3, 21 ) / ( 5, 3, 21 ) }

ATTRIBUTE "Title" {

DATATYPE H5T_STRING {

STRSIZE 25;

}

DATASPACE SCALAR

}

ATTRIBUTE "Units" {

DATATYPE H5T_STRING {

STRSIZE 2;

}

DATASPACE SCALAR

}

ATTRIBUTE "_FillValue" {

DATATYPE H5T_IEEE_F32LE

DATASPACE SIMPLE { ( 1 ) / ( 1 ) }

}

}

Note 3 dimensions

slide22

Aggregated Data Field

$ h5dump -H -d /ANCILLARY_DATA/GeopotentialHeight_NCEP \

AggregatedData.h5

HDF5 "AggregatedData.h5" {

DATASET "/ANCILLARY_DATA/GeopotentialHeight_NCEP" {

DATATYPE H5T_IEEE_F32LE

DATASPACE SIMPLE { ( 5, 3, 21, 4 ) / ( 5, 3, 21, 4 ) }

}

}

Now we have 4 dimensions. The new dimension has extent 4, corresponding to the number of input files.

Note that none of the attributes were copied. This is a bug, but easily fixed.

slide23

To fix the bug, I assume attributes (like “Units”) are not aggregated, and I take attributes and values from first file.

New code consists of only 2+ additional lines, shown below in green.

def VisitAllObjects(Group,Path):

for i in Group.items():

if isinstance(i[1],h5py.Group):

VisitAllObjects(i[1],Path + '/' + i[0])‏

else:

DatasetName = Path + '/' + i[0]

FileInfo[DatasetName] = (Group[DatasetName].shape,

Group[DatasetName].dtype,

Group[DatasetName].attrs.listitems())‏

And also....

DS=OutputFileID.create_dataset(Dataset,data=AggregatedData,compression=5,chunks=FileInfo[Dataset][0]+(1,))‏

[DS.attrs.__setitem__(Attribute[0],Attribute[1]) for Attribute in

FileInfo[Dataset][2]]

slide24

Fixed output:

$ h5dump -H -d /ANCILLARY_DATA/GeopotentialHeight_NCEP \

AggregatedDataWithAttributes.h5

HDF5 "AggregatedDataWithAttributes.h5" {

DATASET "/ANCILLARY_DATA/GeopotentialHeight_NCEP" {

DATATYPE H5T_IEEE_F32LE

DATASPACE SIMPLE { ( 5, 3, 21, 4 ) / ( 5, 3, 21, 4 ) }

ATTRIBUTE "Title" {

DATATYPE H5T_STRING {

STRSIZE 25;

CTYPE H5T_C_S1;

}

DATASPACE SCALAR

}

ATTRIBUTE "Units" {

DATATYPE H5T_STRING {

STRSIZE 2;

CTYPE H5T_C_S1;

}

DATASPACE SCALAR

}

ATTRIBUTE "_FillValue" {

DATATYPE H5T_IEEE_F32LE

DATASPACE SIMPLE { ( 1 ) / ( 1 ) }

}

}

}

Now we have attributes.

slide25

Summary:

Python offers a high degree of flexibility for code development, combined with the ability to do easy text, numerical array and HDF5 coding make it a good candidate for solving problems in satellite remote sensing data processing.

Few, if any, other computer languages offer this combination of benefits making it uniquely suited for this task.

Future Work:

Tracking module version provenance is likely one of the outstanding questions for Python use in production.

Acknowledgment:

Curt Tilmes at NASA Goddard funded this work via contract NNG06HX18C task 614.5-01-07

ad