Micro and Nano Mechanics Group
Revision as of 16:37, 4 July 2010 by Correaa (Talk | contribs)

(written by Alfredo Correa)

Contents

General Remarks

HDF5 is a library for storing large numerical data sets. HDF stands for Hierarchical Data Format. It was designed for saving and retrieving data to/from structured large files. It also supports parallel access to files in HDF5 format, in particular within the MPI environment. HDF5 library can be used from C, C++ (with some limitations) and Fortran. Files can be saved optionally in text format, binary and compressed format (if zlib is available).

HDF5 is recommended by the developers of FFTW as a means to output data from different MPI processes. This format can be read by Mathematica and Matlab.

Using HDF5 (and HDF) in Matlab and Mathematica

Some popular numerical packages, like MATLAB, Mathematica, Octave and ROOT, already have native support for the HDF5 format. If you want to quickly experiment with HDF5 files, you can use those programs to see how it works.

For example, in Mathematica:

 m = RandomInteger[255, {5, 5}]
 Export[ "matrix.h5", m]
 mLoad = Import["matrix.h5", {"Datasets", "/Dataset1"}]
 m==mLoad

will create a binary file called "matrix.h5" with the matrix data.

Matlab can read and write HDF5 file with the hdf5read and hdf5write command (see note about Matlab 7.0 below). Octave can do the same with 'load' and 'save' commands.

Warning: There is a consistent 'bug' in Matlab and Mathematica, when you try to export (with hdf5write or Export) only the real part of the array entries are saved to the file and that happens without any warning.

Partial workaround: Mathematica does not seem to support exporting HDF5 with complex (i.e. not real) data. This seems to be more a limitation of the data types supported by HDF5 combined with the fact that Matlab and Mathematica ignore this limitation. You can still export complex data by taking real and imaginary parts as a workaround:

 Export[ "cmatrix.h5", {Re[cm], Im[cm]}];
 cmLoad = Import["cmatrix.h5", {"Datasets", "/Dataset1"}][[1]] + I*Import["cmatrix.h5", {"Datasets", "/Dataset1"}][[2]]
 cm==cmLoad

Since the data contained in the result is indeed numerical and homogeneous you can take advantage of Mathamatica optimizations such as converting the result ToPackedArray.

Something similar can be done in Matlab.

Matlab 7.0 vs. HDF5 1.8

There are two major versions of the library, namely HDF (also known as HDF4) and HDF5, which are totally incompatible, and minor versions of each which are compatible among them in theory.

In practice Matlab 7's HDF5 interface seems to be unable to read HDF5 files created with (at least) version 1.8 of the HDF5 library. This problem does not exists with HDF5 1.6 (also marked as stable). In general Matlab 7 will produce the following message if trying to read a file that is writen by the newer versions of the HDF5 library:

 >> a=hdf5read('myfile.hdf5','/array');
 ??? /array is not an attribute or a dataset
 
 Error in ==> hdf5read at 85
 [data, attributes] = hdf5readc(filename, datasetName, readAttributes);

A possible workaround is to convert the HDF5 file to HDF4 file and then read it with Matlab 7 with the HDF interface. The command line conversion utility can be downloaded (binary or source) from this link. Once installed (or just copied to the PATH) it can used to convert the file,

 h5toh4 myfile.hdf5 mynewfile.hdf

Later, from Matlab 7, the file can be read without problems:

 >> a=hdfread('mynewfile.hdf','array');

Note the difference in syntax ('hdf5read' vs 'hdfread', and '/array' vs 'array' for the name of the dataset). 'h5toh4' can convert most files with simple structure.

Matlab 7.6 and Mathematica 7 (at least) do not seem to have these issues. So, an alternative is to upgrade Matlab 7 or use the more stable HDF5 1.6.

Install

The following are instructions to install HDF5 in different systems. Note that, as of HDF5 version 1.9 there is no way to use the MPI version and the C++ interfaces together. The C interface can be used from C++ anyway.

Ubuntu

HDF5 1.6.6 can be installed directly in Ubuntu 8.10 by doing:

 sudo apt-get install libhdf5-serial-dev

This will install the C, C++ and Fortran versions of the library and development (header) files, but it will not include the MPI version.

The MPI version with C and Fortran interfaces, (which will remove the previous serial version) can be installed by:

 sudo apt-get install libhdf5-mpich-dev

where 'mpich' can be replaced by 'openmpi' or 'lam'. Unfortunately the C++ interface is not provided for this MPI version -- the two features can not coexist, either you have the C++ interface or you have the MPI support, but not both.

Build and Installation from Sources

We will try to install the parallel version of HDF5 1.9 in our user space. (HDF5 1.8 --official release-- does not play well when compiling with gcc4.)

 mkdir $HOME/usr

from a download location

mkdir $HOME/soft
cd $HOME/soft
export HDF5_VER_MAJOR=19
export HDF5_VER_MINOR=1.9.72
wget ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/hdf5/snapshots/v$HDF5_VER_MAJOR/hdf5-$HDF5_VER_MINOR.tar.gz
tar -zxvf hdf5-$HDF5_VER_MINOR.tar.gz
cd hdf5-$HDF5_VER_MINOR

Other versions (useful for compatibility):

export HDF5_VER_MAJOR=18
export HDF5_VER_MINOR=1.8.4-snap9
export HDF5_VER_MAJOR=16
export HDF5_VER_MINOR=1.6.10-snap10

then we can configure:

CC=mpicc ./configure --prefix=$HOME/usr --enable-parallel --enable-shared

On su-ahpcrc we use the following configure command:

CC=icc ./configure --prefix=$HOME/usr --enable-shared


Other options are described in ./configure --help. The option --enable-cxx can be specified but not together with --enable-parallel. For non-parallel version adding "CC=mpicc" (or equivalent) is not necessary. Check that your MPI compiler is present by doing:

 which mpicc

Then we can make and install

NUM_CORES=`cat /proc/cpuinfo | grep processor | wc -l`
make --jobs=$NUM_CORES
make install

The compilation takes ~5 minutes and several warning messages will appear. Many header files will be installed in ~/usr/include and ~/usr/lib

 ~/usr/include/H5*.h (around 40 files)
 ~/usr/include/hdf5[|_hl].h
 ~/usr/lib/libhdf5[|_hl].[a|la]

The most important for us are hdf5.h and libhdf5.a. There are also some command line utilities to manage HDF5 files installed:

 ~/usr/bin/h5*

Among them, there is 'h5dump', which will be used in the next section.

Test Example

The source files contain examples on the usage of HDF5, including C++ examples. (See directories ./examples, ./hl/examples, ./c++/examples. and ./hl/c++/examples)

Simple introductory examples are also provided online, but they are outdated and are incompatible with this version of HDF5 1.8 (this issue can be very confusing). It is better to use the examples contained in the distribution file and use the online documentation to read the details of the examples. In any case here I provide the sources and makefile I used for testing the installation h5_test.tar.gz. Use the example as follows:

 wget http://micro.stanford.edu/mediawiki/images/b/bc/H5_test.tar.gz -O h5_test.tar.gz
 tar -zxvf h5_test.tar.gz
 cd h5_test
 make test

Internally, in the Makefile, the compilation is performed by the command line:

 mpicc -I${HOME}/usr/include h5_write.c -L${HOME}/usr/lib -lhdf5 -lm -lpng -o h5_write

Depending on the code, in some cases we have to add '-lz -lrt'.

A write and read program will be compiled. The write program will create a binary file named SDS.h5 with the data of a certain array, then this array will be loaded from the file by the read program and printed.

The SDS.h5 is in a compressed binary format, which means that it can not be read directly. However there exists a bunch of HDF5 utilities (external programs) that allows humans to see what is contained in the files:

 $ ~/usr/bin/h5dump SDS.h5
 HDF5 "SDS.h5" {
 GROUP "/" {
    DATASET "IntArray" {
       DATATYPE  H5T_STD_I32LE
       DATASPACE  SIMPLE { ( 5, 6 ) / ( 5, 6 ) }
       DATA {
       (0,0): 0, 1, 2, 3, 4, 5,
       (1,0): 1, 2, 3, 4, 5, 6,
       (2,0): 2, 3, 4, 5, 6, 7,
       (3,0): 3, 4, 5, 6, 7, 8,
       (4,0): 4, 5, 6, 7, 8, 9
       }
    }
 }
 }

For the moment this document is not a tutorial on HDF5 itself but it is only to document on its installation. However, we can already mention something about the structure of the file: For example, in the previous h5dump you can read 'GROUP "/"', this indicates that the dataset is at root level ('/') of the file. The HDF5 file can look pretty much like a filesystem, with directories, subdirectories and files/datasets.

MPI tests

I collected this three examples that can test the parallel capabilities of the library. These are based in the parallel HDF5 tutorial and official example file, but since those are outdated and do not compile out of the box, I corrected modified them and posted them here.

 wget http://micro.stanford.edu/mediawiki-1.11.0/images/H5mpi_test.tar
 tar -xvf H5mpi_test.tar
 cd h5mpi_test
 make all

Inspection Tools

Note that last quoted code is not the real content of the file but just a human readable translation, that was accessed by means of one of the tools installed in ~/usr/bin. There are even graphical programs to visualize the contents of HDF5 file. These programs that read HDF5 have the advantage that they do not load the data into memory. You can inspect the contents of a 40GB file without hanging the system.

The official viewer is HDFView. Besides showing the table it has a limited capability to plot (parts) of the data. It is installed as follow

cd ~/soft
wget http://www.hdfgroup.org/ftp/HDF5/hdf-java/hdfview/hdfview_install_linux32.bin
sudo sh ./hdfview_install_linux32.bin
/usr/local/hdfveiw/bin/hdfview.sh

Hdf5 hdview.png

An alternative program is ViTables.

sudo apt-get install python-tables python-qt4
cd ~/soft
wget http://download.berlios.de/vitables/ViTables-2.0.tar.gz
tar -zxvf ViTables-2.0.tar.gz
cd ViTables-2.0
sudo python setup.py install
vitables

Note: the installation doesn't work in Ubuntu 9.04 (shipped with python 2.6) at the moment [1], as a temporary workaround install python2.5 and use the following line instead:

sudo apt-get install python2.5
sudo python2.5 setup.py install

After the installation the program runs normally (command line is vitables). The program can show information of multidimensional arrays (more than 2 dimensions) by opening sub-windows.

Hdf5 vitables.png