Latest revision as of 05:31, 1 April 2015

Parallelization of the Phase Field Model

Yanming Wang and Wei Cai

Mar 31 , 2015

A brief description

The basic formulation of the phase field model is described in our MSMSE 2014 paper. Base on this, a C++ serial code has been written under the MD++ framework. The below figure gives a flow chart of the phase field code, in order to clarify the codes’ structure.

Starting from the serial c++ code, parallelization is done with OpenMP, MPI, and CUDA three approaches. These parallel codes can be downloaded together with the MD++ package from the svn server. The following command will checkout the latest MD++ to your current directory.

svn co https://micro.stanford.edu/svn/MD++/trunk/ ./

After you have the MD++ package (with the revision later than r478), you may follow the below sections, in which we discussed how to compile and run these parallel codes with details.

OpenMP code

src/phasefield_omp.cpp contains our implementation of the OpenMP code. To compile the code, using cluster MC2 as an example, type

   make phasefield build=R SYS=mc2_omp

You may check src/Makefile.base to look at the flags we set up for compiling OpenMP. Generally adding the specification of “-openmp” for icc compiler or “-fopenmp” for gcc compiler should make the computer to recognize OpenMP.

In this example, if the code is compiled successfully, the executable should be named as phasefield_mc2_omp in the bin/ folder.

You can specify the number of threads you want to use. For example if you want to have 8 threads for your simulation, you can type the following line in the command window (or include it in the PBS script) .

   export OMP_NUM_THREADS = 8

MPI code

The MPI related files are src/phasefield_mpi.cpp, and src/StencilToolkit. Some modifications are also made in src/main.cpp for initialize and finalize MPI

For the current MPI implementation, we adopted the StencilToolkit library developed by KISTI to divide the 3D arrays into designate chunks, considering the periodic boundary and the boundary synchronization. The number of nodes is specified with the following command in the src/phasefield.cpp. n_x, n_y and n_z give the number of nodes in each dimension. _node = new Node3D(n_z, n_y, n_x) These numbers are required to be specified before compilation. In addition, the number of grids should be divisible by the number of nodes for each dimension. For example, NX%n_x = 0. To compile the code, still using cluster MC2 as the example, type

  make phasefield build=R SYS=mc2_mpi MPI=yes

After the MPI code is compiled, it should generate the executable phasefield_mc2_mpi in bin/ and a library file libstk.so in the same folder. To make this shared library loaded when the program is running, the following command is required, assuming the current directory is the MD++ home folder.

 export LD_LIBRARY_PATH = “./bin:$LD_LIBRARY_PATH”

CUDA code

The CUDA code is implemented as src/phasefield_cuda.cu. The compilation of the code requires nvcc compiler. Here we use cluster Sherlock as an example. You may need to load the CUDA module first by entering

 module load cuda

Then type the following command to compile the code,

 make phasefield build=R SYS=sherlock CUDA=yes

When the compilation is finished, the executable named as phasefield_sherlock will be created in the src/ folder.

Test cases

We wrote a tcl input script pf3d_test.tcl for code validation and performance evaluation. The initial configuration is set to a spherical liquid droplet at the solid-vapor interface with a box size of 200x200x200. 200 steps’ simulation is run with dynamics_type = 8 (constrain the liquid volume and the droplet’s center of mass position in both x and y directions).

Serial code

The serial code can be considered as the reference state. To run the code,

bin/phasefield_mc2 scripts/work/phasefield/pf3d_test.tcl 0 1

The simulation results are printed on the screen. The followings are the output for the 1st step and the last step.

The output for the first step should be like:

curstep =      0 F =   3.556137471408e+05 Fraw =   3.556137471408e+05 G =   3.160552999005e+01 timestep = 3.16e-05 
rel_vol = ( 1.21,   42, 56.8)% M01=0.0943 M02=0.0443 COM_x=-0.998 COM_y=-0.998

The output for the step 200 should be like:

curstep =    200 F =   3.438630312533e+05 Fraw =   3.440320588335e+05 G =   2.139589914093e+01 timestep = 4.67e-05 
rel_vol = ( 1.21,   42, 56.8)% M01=-11.2 M02=-11.2 COM_x=-0.998 COM_y=-0.998

The time cost is 42m26.627s.

OpenMP code

Under the MD++ home folder, the following command is used to run the phase field simulation on MC2.

 bin/phasefield_mc2_omp scripts/work/phasefield/pf3d_test.tcl 1 1

The output for the first step should be like:

 curstep =      0 F =   3.556137471166e+05 Fraw =   3.556137471166e+05 G =   3.160552999005e+01 timestep = 3.16e-05 
rel_vol = ( 1.21,   42, 56.8)% M01=0.0943 M02=0.0443 COM_x=-0.998 COM_y=-0.998

The output for the step 200 should be like:

 curstep =    200 F =   3.438630312384e+05 Fraw =   3.440320588185e+05 G =   2.139589914093e+01 timestep = 4.67e-05 
rel_vol = ( 1.21,   42, 56.8)% M01=-11.2 M02=-11.2 COM_x=-0.998 COM_y=-0.998

The time cost is 11m2.548s.

MPI code

The MPI code can be run interactively by the command

 mpirun –np 32 bin/phasefield_mc2_mpi scripts/work/phasefield/pf3d_test.tcl 2 1

The above command runs the job with 32 CPUS, which should be consistent with the number of processors pre-set in src/phasefield.cpp.

The output for the first step should be like:

 curstep =      0 F =   3.556137471184e+05 Fraw =   3.556137471184e+05 G =   3.160552999005e+01 timestep = 3.16e-05 
rel_vol = ( 1.21,   42, 56.8)% M01=0.0943 M02=0.0443 COM_x=-0.998 COM_y=-0.998

The output for the step 200 should be like:

 curstep =    200 F =   3.438630312383e+05 Fraw =   3.440320588185e+05 G =   2.139589914093e+01 timestep = 4.67e-05 
rel_vol = ( 1.21,   42, 56.8)% M01=-11.2 M02=-11.2 COM_x=-0.998 COM_y=-0.998

The time cost is 170.368100 s

CUDA code

Assuming the CUDA code has been compiled on Sherlock, to run the simulation interactively, we may need to reserve one GPU first by typing the command,

salloc -N 1 -p gpu --qos=gpu --gres=gpu:1 --constraint="k20x"

This reserves us the K20X GPU. Next to run the phase field simulation with CUDA, you may enter the following command under the MD++ home folder.

 srun bin/phasefield_sherlock scripts/work/phasefield/pf3d_test.tcl 3 1

One thing should be mentioned that “model_type = 20” should be specified to call the CUDA multi-phase field function in the input script. For single phase field function, model_type = 10.

The output for the first step should be like:

 curstep =      0 F =   3.556137471188e+05 Fraw =   3.556137471188e+05 G =   3.160552999005e+01 timestep = 3.16e-05 
rel_vol = ( 1.21,   42, 56.8)% M01=0.0943 M02=0.0443 COM_x=-0.998 COM_y=-0.998

The output for the step 200 should be like:

 curstep =    200 F =   3.438633096736e+05 Fraw =   3.440320815069e+05 G =   2.139663658787e+01 timestep = 4.67e-05 
rel_vol = ( 1.21,   42, 56.8)% M01=-11.2 M02=-11.2 COM_x=-0.998 COM_y=-0.998

The time cost is 60.879 s.

Summary

From the above test cases, we can find the CUDA code gives the largest speedup factor, which is over 40. In comparison, the MPI code with 32 cores gives a speedup factor of around 14, but the code can be accelerated further with more nodes. In addition, for simulations with extreme large size, MPI is expected to be the better solution, since memory may become an issue for the CUDA code. Though OpenMP code doesn't obtain a very large speedup factor (the speedup factor of OpenMP is about 4), it is very easy to implement and can be applied to a personal desktop or laptop with a reasonable acceleration.

Parallelization of the Phase Field Model: Difference between revisions

Latest revision as of 05:31, 1 April 2015

Contents

Parallelization of the Phase Field Model

A brief description

OpenMP code

MPI code

CUDA code

Test cases

Serial code

OpenMP code

MPI code

CUDA code

Summary

Navigation menu

Parallelization of the Phase Field Model: Difference between revisions

Latest revision as of 05:31, 1 April 2015

Parallelization of the Phase Field Model

A brief description

OpenMP code

MPI code

CUDA code

Test cases

Serial code

OpenMP code

MPI code

CUDA code

Summary

Navigation menu

Search