Micro and Nano Mechanics Group

(original version by Alfredo Correa)

FFTW3 is a library designed to compute discrete Fourier transforms. As described in the the official FFTW site, there are various versions available, with different features and different levels of maturity. In this tutorial I deal with the installation of version 3, including the experimental MPI version. However the installation instructions seems to be valid (although not tested) also for the more popular version 2.


General Remarks

As usual we would like to install the libraries in the user space, so we will create a couple of directories for that purpose:

mkdir $HOME/usr
mkdir $HOME/soft

To install FFTW3, download the package from the FFTW3 download page and decompress it:

cd ~/soft
wget http://www.fftw.org/fftw-3.3alpha1.tar.gz
tar -zxvf fftw-3.3alpha1.tar.gz
cd fftw-3.3alpha1

Ubuntu only: If you want to install FFTW3 (serial version) in your local Ubuntu you can skip this installation section altogether and just run:

 sudo apt-get install libfftw3-dev libfftw3-doc

However the MPI version (e.g. for testing) will not be available. If you want to have the MPI version follow the instructions in the other sections.

Build and Install

The configure/make/install procedure works well for installation in wcr.stanford.edu. We have the option of building and using static or shared libraries. If you are going to use shared libraries read last section.

Serial version only

Then configure, make and install:

 ./configure --prefix=$HOME/usr --enable-shared=yes
 make --jobs=8
 make install

(1 minute). The following files will be installed in:


The typical compilation options will be

 export LD_RUN_PATH=$HOME/usr/lib  #do this once *before* compiling
 cc -I$HOME/usr/include program.c -L$HOME/usr/lib -lfftw3 -o program

The official tutorial on the usage of FFTW3 (which is different from FFTW 2) is located here.

MPI version

To install the experimental MPI version of FFTW3, make sure you downloaded fftw-3.3alpha1 (and not fftw-3.2 for example). Also make sure that there is an MPI compiler available:

$which mpicc

If it is not available, you can choose one with the command 'mpi-selector-menu' in wcr. I tested this with the 'openmpi_gcc-1.2.2' compiler. If that is not possible then set the variable MPICC, for example: export MPICC=$HOME/usr/bin/mpicc.

Do the same procedure of downloading the file and decompressing it, but add the --enable-mpi flag:

./configure --prefix=$HOME/usr --enable-mpi --enable-shared=yes
make install

now the library will be installed in your home directory, besides the files mentioned above, you will find also:


The typical command line for compilation will be

export LD_RUN_PATH=$HOME/usr/lib #do this *before* compiling
mpicc -I$HOME/usr mpi_program.c -L$HOME/usr/lib -lfftw3_mpi -lfftw3 -o mpi_program

Make sure to link *first* to fftw3_mpi and *later* to fftw3. For some MPI platforms (notably openmpi-gcc) setting LD_RUN_PATH does not do the job of storing the library path inside the executables (see note in previous section), it may be necessary to use the following command:

mpicc -Wl,-rpath=$HOME/usr/lib -I$HOME/usr mpi_program.c -L$HOME/usr/lib -lfftw3_mpi -lfftw3 -o mpi_program

In any case we should always check that the executable is properly linked by doing

ldd ./mpi_program

and checking that all shared libraries are "found".

The official tutorial for the MPI version of FFTW3 can be found here.

Compilation with Shared Libraries

In the command line compilation examples above I set the variables LD_RUN_PATH. Using LD_RUN_PATH saves us from having to set path variables before *running* the program, such as LD_LIBRARY_PATH (which is a bad practice). When LD_RUN_PATH is set before compilation, the created executable will store the search path of the shared library internally (but will not enforce it). I learned this trick from http://gcc.gnu.org/faq.html#rpath and it works well with gcc at least. Setting this variable before compilation can be annoying, but is better than having to set variables *each time we use* the executable. This seems to be the only good option left when using libraries installed in the home directory (does anybody know a better alternative?).

Update on LD_RUN_PATH: It seems that the trick of setting this variable does not work with mpicxx, at least with the openmpi implementation. A more general way to store the path of the libraries in the executable seems to be to use the option '-Wl,-rpath=$HOME/usr/lib' (sic); as described here.

Another option to avoid this issues all together is to use the -static option when compiling to ensure that the static fftw3 library files are embedded in the executable.

Basic Usage


The following is a complete program using FFTW3, for a serial 2D transform. Sources and makefile are attached here. The program will

  • allocate memory using FFTW special allocation
  • specify the transformation type
  • fill the array with data and compute its power
  • execute the transformation
  • normalize and compute the power of the transformed data
 /* cc -I$HOME/usr/include simple_example.c -L$HOME/usr/lib -lfftw3 -o simple_example */
 #include <fftw3.h>
 #include <math.h>
 int main(int argc, char **argv){
   const ptrdiff_t N0 = 18, N1 = 18;
   fftw_plan plan;
   fftw_complex *data;
   data = (fftw_complex *) fftw_malloc(sizeof(fftw_complex) * N0 * N1);
   /* create plan for forward DFT */
   plan = fftw_plan_dft_2d(N0, N1, data, data, FFTW_FORWARD, FFTW_ESTIMATE);
   /* initialize data to some function my_function(x,y) */
   int i, j;
   double pdata=0;
   for (i = 0; i < N0; ++i){
     for (j = 0; j < N1; ++j){
       data[i*N1 + j][0]=i; 
       data[i*N1 + j][1]=0;
       pdata+=data[i*N1 + j][0]*data[i*N1 + j][0]+data[i*N1 + j][1]*data[i*N1 + j][1];
   printf("power of original data is %f\n", pdata);
   /* compute transforms, in-place, as many times as desired */
   double normalization=sqrt((double)N0*N1);
   double ptransform = 0;
   /*normalize data and calculate power of transform */
   for (i = 0; i < N0; ++i){
     for (j = 0; j < N1; ++j){
       ptransform+=data[i*N1 + j][0]*data[i*N1 + j][0]+data[i*N1 + j][1]*data[i*N1 + j][1];
   printf("power of transform is %f\n", pdata);
   return 0;

To run the program do the following

 wget http://micro.stanford.edu/mediawiki-1.11.0/images/Simple_example.tar -O simple_example.tar
 tar -xvf simple_example.tar 
 cd simple_example

and it will have the following output

 $ ./simple_example 
 power of original data is 32130.000000
 power of transform is 32130.000000

Parallel MPI

The following compilable example is based on the fftw3 simple mpi example. It can be regarded as the parallel version of the previous serial example. The plain-C sources and makefiles are available from here.

The program does the following

  • MPI environment is initialized
  • MPI FFTW environment is initialized
  • asks fftw how to split a 2D array among several processes by means of , this returns index ranges and local memory needed
  • the needed memory is allocated, some extra memory can be required by the fftw, so the three outputs of fftw_mpi_local_size_2d are not redundant
  • global array is partially filled in each process
  • local power of data is calculated
  • the global transform is performed
  • local power of transformed data is calculated taking into account normalization
  • sum of local power of data and local power of transform should be the same

The program listing is the following:

 #include <fftw3-mpi.h>
 int main(int argc, char **argv){
 const ptrdiff_t N0 = 10000, N1 = 10000;
 fftw_plan plan;
 fftw_complex *data; //local data of course
 ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
 MPI_Init(&argc, &argv);
 /* get local data size and allocate */
 alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD,
                                      &local_n0, &local_0_start);
 data = (fftw_complex *) fftw_malloc(sizeof(fftw_complex) * alloc_local);
 printf("%i\n", local_n0);
 /* create plan for forward DFT */
 plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD,
                             FFTW_FORWARD, FFTW_ESTIMATE);
 /* initialize data to some function my_function(x,y) */
 for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j){
   data[i*N1 + j][0]=local_0_start; 
   data[i*N1 + j][1]=i;
 /* compute transforms, in-place, as many times as desired */
 return 0;

To run the program, do

 wget http://micro.stanford.edu/mediawiki-1.11.0/images/Simple_mpi_example.tar -O simple_mpi_example.tar
 tar -xf simple_mpi_example.tar 
 cd simple_mpi_example
 mpirun -np 2 ./simple_mpi_example

Important notes about MPI-FFTW3 memory distribution (do not skip)

Although the resulting distribution of the matrix and allocated memory seems to very reasonable in the previous example, this is only because of the simple dimensions and process number used in the example. There are some characteristics of the memory distribution for which the previous example can be misleading, and it is worth noting them:

  • Only FFTW (via fftw_mpi_local_size_2d) knows how to split the data and allocated memory size among processes, so don't try to predict the splitting yourself.
  • The splitting of the global matrix is only done over rows (first dimension, for N>2-dimensional arrays).
  • 1D FFT's usage is totally different, because the splitting depends on the type of transform to be performed (e.g. in-place, out-of-place).
  • The splitting of the matrix can be uneven (for example, #columns is not divisible by #processes).
  • Some processes can be assigned zero columns (for example if there are more processes than columns). Even in this case, the allocated memory in that process can be different from zero! (in general, it is 1*sizeof(complex)).
  • The logical size of the submatrix (local matrix) can be different from the local allocated memory, i.e. local_n0*N1 <= alloc_local. The outputs of local_size are not redundant. FFTW uses the (small) extra memory as a scratch space.