Micro and Nano Mechanics Group
Revision as of 10:51, 5 January 2010 by Correaa (Talk | contribs)

Contents

General Remarks

HDF5 is a library for storing large numerical data sets. HDF stands for Hierarchical Data Format. It was designed for saving and retrieving data to/from structured large files. It also supports parallel access to files in HDF5 format, in particular within the MPI environment. HDF5 library can be used from C, C++ (with some limitations) and Fortran. Files can be saved optionally in text format, binary and compressed format (if zlib is available).

HDF5 is recommended by the developers of FFTW as a means to output data from different MPI processes.

Using HDF5 (and HDF) in Matlab and Mathematica

Some popular numerical packages, like MATLAB, Mathematica, Octave and ROOT, already have native support for the HDF5 format. If you want to quickly experiment with HDF5 files, you can use those programs to see how it works.

For example, in Mathematica:

 m = RandomInteger[255, {5, 5}]
 Export[ "matrix.h5", m]
 mLoad = Import["matrix.h5", {"Datasets", "/Dataset1"}]
 m==mLoad

will create a binary file called "matrix.h5" with the matrix data.

Matlab can read and write HDF5 file with the hdf5read and hdf5write command (see note about Matlab 7.0 below). Octave can do the same with 'load' and 'save' commands.

Warning: There is a consistent 'bug' in Matlab and Mathematica, when you try to export (with hdf5write or Export) only the real part of the array entries are saved to the file and that happens without any warning.

Partial workaround: Mathematica does not seem to support exporting HDF5 with complex (i.e. not real) data. This seems to be more a limitation of Mathematica (blame Wolfram) than of HDF5. You can still export complex data by taking real and imaginary parts as a workaround:

 Export[ "cmatrix.h5", {Re[cm], Im[cm]}];
 cmLoad = Import["cmatrix.h5", {"Datasets", "/Dataset1"}][[1]] + I*Import["cmatrix.h5", {"Datasets", "/Dataset1"}][[2]]
 cm==cmLoad

Something similar can be done in Matlab.

Matlab 7.0 vs. HDF5 1.8

There are two major versions of the library, namely HDF (also known as HDF4) and HDF5, which are totally incompatible, and minor versions of each which are compatible among them in theory.

In practice Matlab 7's HDF5 interface seems to be unable to read HDF5 files created with (at least) version 1.8 of the HDF5 library. This problem does not exists with HDF5 1.6 (also marked as stable). In general Matlab 7 will produce the following message if trying to read a file that is writen by the newer versions of the HDF5 library:

 >> a=hdf5read('myfile.hdf5','/array');
 ??? /array is not an attribute or a dataset
 
 Error in ==> hdf5read at 85
 [data, attributes] = hdf5readc(filename, datasetName, readAttributes);

A possible workaround is to convert the HDF5 file to HDF4 file and then read it with Matlab 7 with the HDF interface. The command line conversion utility can be downloaded (binary or source) from this link. Once installed (or just copied to the PATH) it can used to convert the file,

 h5toh4 myfile.hdf5 mynewfile.hdf

Later, from Matlab 7, the file can be read without problems:

 >> a=hdfread('mynewfile.hdf','array');

Note the difference in syntax ('hdf5read' vs 'hdfread', and '/array' vs 'array' for the name of the dataset). 'h5toh4' can convert most files with simple structure.

Matlab 7.6 and Mathematica 7 (at least) do not seem to have these issues. So, an alternative is to upgrade Matlab 7 or use the more stable HDF5 1.6.

Install

The following are instructions to install HDF5 in different systems. Note that, as of HDF5 version 1.9 there is no way to use the MPI version and the C++ interfaces together. The C interface can be used from C++ anyway.

Ubuntu

HDF5 1.6.6 can be installed directly in Ubuntu 8.10 by doing:

 sudo apt-get install libhdf5-serial-dev

This will install the C, C++ and Fortran versions of the library and development (header) files, but it will not include the MPI version.

The MPI version, (which will remove the previous serial version) can be installed by:

 sudo apt-get install libhdf5-mpich-dev

where 'mpich' can be replaced by 'openmpi' or 'lam'. Unfortunately the C++ interface is not provided for this MPI version -- the two version are incompatible.

Build and Installation from Sources

We will try to install the parallel version of HDF5 1.9 in our user space. (HDF5 1.8 --official release-- does not play well when compiling with gcc4.)

 mkdir $HOME/usr

from a download location

 mkdir $HOME/soft
 cd $HOME/soft
 export LATEST_V=1.9.43
 wget ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/hdf5/snapshots/v19/hdf5-$LATEST_V.tar.gz
 tar -zxvf hdf5-$LATEST_V.tar.gz
 cd hdf5-$LATEST_V

(see here for latest version, or use http://www.hdfgroup.org/ftp/HDF5/current/src/hdf5-1.8.3.tar.gz for the last stable version).

then we can configure:

 CC=mpicc.mpich ./configure --prefix=$HOME/usr --enable-parallel --enable-shared

Other options are described in ./configure --help. The option --enable-cxx can be specified but not together with --enable-parallel. For non-parallel version adding "CC=mpicc" (or equivalent) is not necessary. Check that your MPI compiler is present by doing:

 which mpicc

Then we can make and install

 make
 make install

The compilation takes ~5 minutes and several warning messages will appear. Many header files will be installed in ~/usr/include and ~/usr/lib

 ~/usr/include/H5*.h (around 40 files)
 ~/usr/include/hdf5[|_hl].h
 ~/usr/lib/libhdf5[|_hl].[a|la]

The most important for us are hdf5.h and libhdf5.a. There are also some command line utilities to manage HDF5 files installed:

 ~/usr/bin/h5*

Among them, there is 'h5dump', which will be used in the next section.

Test Example

The source files contain examples on the usage of HDF5, including C++ examples. (See directories ./examples, ./hl/examples, ./c++/examples. and ./hl/c++/examples)

Simple introductory examples are also provided online, but they are outdated and are incompatible with this version of HDF5. This issue can be very confusing. It is better to use the examples contained in the distribution file and use the online documentation to read the details of the examples. In any case here I provide the sources and makefile I used for testing the installation h5_test.tar.gz. Use the example as follows:

 wget http://micro.stanford.edu/mediawiki-1.11.0/images/H5_test.tar.gz -O h5_test.tar.gz
 tar -zxvf h5_test.tar.gz
 cd h5_test
 make test

Internally in the Makefile the compilation is performed by the command line:

 mpicc -I${HOME}/usr/include h5_write.c -L${HOME}/usr/lib -lhdf5 -lm -lpng -o h5_write

Depending on the code, in some cases we have to add '-lz -lrt'.

A write and read program will be compiled. The write program will create a binary file named SDS.h5 with the data of a certain array, then this array will be loaded from the file by the read program and printed.

The SDS.h5 is in a compressed binary format, which means that it can not be read directly. However there exists a bunch of HDF5 utilities (external programs) that allows humans to see what is contained in the files:

 $ ~/usr/bin/h5dump SDS.h5
 HDF5 "SDS.h5" {
 GROUP "/" {
    DATASET "IntArray" {
       DATATYPE  H5T_STD_I32LE
       DATASPACE  SIMPLE { ( 5, 6 ) / ( 5, 6 ) }
       DATA {
       (0,0): 0, 1, 2, 3, 4, 5,
       (1,0): 1, 2, 3, 4, 5, 6,
       (2,0): 2, 3, 4, 5, 6, 7,
       (3,0): 3, 4, 5, 6, 7, 8,
       (4,0): 4, 5, 6, 7, 8, 9
       }
    }
 }
 }

For the moment this document is not a tutorial on HDF5 itself but it is only to document on its installation. However, we can already mention something about the structure of the file: For example, in the previous h5dump you can read 'GROUP "/"', this indicates that the dataset is at root level ('/') of the file. The HDF5 file can look pretty much like a filesystem, with directories, subdirectories and files/datasets.

MPI tests

I collected this three examples that can test the parallel capabilities of the library. These are based in the parallel HDF5 tutorial and official example file, but since those are outdated and do not compile out of the box, I corrected modified them and posted them here.

 wget http://micro.stanford.edu/mediawiki-1.11.0/images/H5mpi_test.tar
 tar -xvf H5mpi_test.tar
 cd h5mpi_test
 make all

Visualization Tools

Note that last quoted code is not the real content of the file but just a human readable translation, that was accessed by means of one of the tools installed in ~/usr/bin. There are even graphical programs to visualize the contents of HDF5 file. These programs that read HDF5 have the advantage that they do not load the data into memory. You can visualize the contents of a 40GB file without hanging the system.

The official one is the Java program HDFView (not tested here because of problems with my Java environment). An alternative that I tested is ViTables.

sudo apt-get install python-tables python-qt4
cd ~/soft
wget http://download.berlios.de/vitables/ViTables-2.0.tar.gz
tar -zxvf ViTables-2.0.tar.gz
cd ViTables-2.0
sudo python setup.py install
vitables

Note: the installation doesn't work in Ubuntu 9.04 (shipped with python 2.6) at the moment [1], as a temporary workaround install python2.5 and use the following line instead:

sudo apt-get install python2.5
sudo python2.5 setup.py install

After the installation the program runs normally (command line is vitables).

Hdf5 vitables.png