Parallelization of the Phase Field Model: Difference between revisions
No edit summary |
No edit summary |
||
| (8 intermediate revisions by the same user not shown) | |||
| Line 15: | Line 15: | ||
[[Image: Phasefield_flow_chart.jpg | frame | center]] |
[[Image: Phasefield_flow_chart.jpg | frame | center]] |
||
Starting from the serial c++ code, parallelization is done with OpenMP, MPI, and CUDA three approaches. |
Starting from the serial c++ code, parallelization is done with OpenMP, MPI, and CUDA three approaches. These parallel codes can be downloaded together with the MD++ package from the '''svn''' server. The following command will checkout the latest MD++ to your current directory. |
||
svn co https://micro.stanford.edu/svn/MD++/trunk/ ./ |
|||
After you have the MD++ package (with the revision later than r478), you may follow the below sections, in which we discussed how to compile and run these parallel codes with details. |
|||
== OpenMP code == |
== OpenMP code == |
||
src/phasefield_omp.cpp contains our implementation of the OpenMP code. To compile the code, using cluster MC2 as an example, type |
src/phasefield_omp.cpp contains our implementation of the OpenMP code. To compile the code, using cluster MC2 as an example, type |
||
make phasefield build=R SYS=mc2_omp |
make phasefield build=R SYS=mc2_omp |
||
| Line 55: | Line 61: | ||
When the compilation is finished, the executable named as phasefield_sherlock will be created in the src/ folder. |
When the compilation is finished, the executable named as phasefield_sherlock will be created in the src/ folder. |
||
== Test cases == |
|||
We wrote a tcl input script [[media:Pf3d_test.tcl.txt | pf3d_test.tcl ]] for code validation and performance evaluation. The initial configuration is set to a spherical liquid droplet at the solid-vapor interface with a box size of 200x200x200. 200 steps’ simulation is run with dynamics_type = 8 (constrain the liquid volume and the droplet’s center of mass position in both x and y directions). |
|||
=== Serial code === |
|||
The serial code can be considered as the reference state. To run the code, |
|||
bin/phasefield_mc2 scripts/work/phasefield/pf3d_test.tcl 0 1 |
|||
The simulation results are printed on the screen. The followings are the output for the 1st step and the last step. |
|||
The output for the first step should be like: |
|||
curstep = 0 F = 3.556137471408e+05 Fraw = 3.556137471408e+05 G = 3.160552999005e+01 timestep = 3.16e-05 |
|||
rel_vol = ( 1.21, 42, 56.8)% M01=0.0943 M02=0.0443 COM_x=-0.998 COM_y=-0.998 |
|||
The output for the step 200 should be like: |
|||
curstep = 200 F = 3.438630312533e+05 Fraw = 3.440320588335e+05 G = 2.139589914093e+01 timestep = 4.67e-05 |
|||
rel_vol = ( 1.21, 42, 56.8)% M01=-11.2 M02=-11.2 COM_x=-0.998 COM_y=-0.998 |
|||
The time cost is 42m26.627s. |
|||
=== OpenMP code === |
|||
Under the MD++ home folder, the following command is used to run the phase field simulation on MC2. |
|||
bin/phasefield_mc2_omp scripts/work/phasefield/pf3d_test.tcl 1 1 |
|||
The output for the first step should be like: |
|||
curstep = 0 F = 3.556137471166e+05 Fraw = 3.556137471166e+05 G = 3.160552999005e+01 timestep = 3.16e-05 |
|||
rel_vol = ( 1.21, 42, 56.8)% M01=0.0943 M02=0.0443 COM_x=-0.998 COM_y=-0.998 |
|||
The output for the step 200 should be like: |
|||
curstep = 200 F = 3.438630312384e+05 Fraw = 3.440320588185e+05 G = 2.139589914093e+01 timestep = 4.67e-05 |
|||
rel_vol = ( 1.21, 42, 56.8)% M01=-11.2 M02=-11.2 COM_x=-0.998 COM_y=-0.998 |
|||
The time cost is 11m2.548s. |
|||
=== MPI code === |
|||
The MPI code can be run interactively by the command |
|||
mpirun –np 32 bin/phasefield_mc2_mpi scripts/work/phasefield/pf3d_test.tcl 2 1 |
|||
The above command runs the job with 32 CPUS, which should be consistent with the number of processors pre-set in src/phasefield.cpp. |
|||
The output for the first step should be like: |
|||
curstep = 0 F = 3.556137471184e+05 Fraw = 3.556137471184e+05 G = 3.160552999005e+01 timestep = 3.16e-05 |
|||
rel_vol = ( 1.21, 42, 56.8)% M01=0.0943 M02=0.0443 COM_x=-0.998 COM_y=-0.998 |
|||
The output for the step 200 should be like: |
|||
curstep = 200 F = 3.438630312383e+05 Fraw = 3.440320588185e+05 G = 2.139589914093e+01 timestep = 4.67e-05 |
|||
rel_vol = ( 1.21, 42, 56.8)% M01=-11.2 M02=-11.2 COM_x=-0.998 COM_y=-0.998 |
|||
The time cost is 170.368100 s |
|||
=== CUDA code === |
|||
Assuming the CUDA code has been compiled on Sherlock, to run the simulation interactively, we may need to reserve one GPU first by typing the command, |
|||
salloc -N 1 -p gpu --qos=gpu --gres=gpu:1 --constraint="k20x" |
|||
This reserves us the K20X GPU. Next to run the phase field simulation with CUDA, you may enter the following command under the MD++ home folder. |
|||
srun bin/phasefield_sherlock scripts/work/phasefield/pf3d_test.tcl 3 1 |
|||
One thing should be mentioned that “model_type = 20” should be specified to call the CUDA multi-phase field function in the input script. For single phase field function, model_type = 10. |
|||
The output for the first step should be like: |
|||
curstep = 0 F = 3.556137471188e+05 Fraw = 3.556137471188e+05 G = 3.160552999005e+01 timestep = 3.16e-05 |
|||
rel_vol = ( 1.21, 42, 56.8)% M01=0.0943 M02=0.0443 COM_x=-0.998 COM_y=-0.998 |
|||
The output for the step 200 should be like: |
|||
curstep = 200 F = 3.438633096736e+05 Fraw = 3.440320815069e+05 G = 2.139663658787e+01 timestep = 4.67e-05 |
|||
rel_vol = ( 1.21, 42, 56.8)% M01=-11.2 M02=-11.2 COM_x=-0.998 COM_y=-0.998 |
|||
The time cost is 60.879 s. |
|||
== Summary == |
|||
From the above test cases, we can find the CUDA code gives the largest speedup factor, which is over 40. In comparison, the MPI code with 32 cores gives a speedup factor of around 14, but the code can be accelerated further with more nodes. In addition, for simulations with extreme large size, MPI is expected to be the better solution, since memory may become an issue for the CUDA code. Though OpenMP code doesn't obtain a very large speedup factor (the speedup factor of OpenMP is about 4), it is very easy to implement and can be applied to a personal desktop or laptop with a reasonable acceleration. |
|||
Latest revision as of 05:31, 1 April 2015
Parallelization of the Phase Field Model
Yanming Wang and Wei Cai
A brief description
The basic formulation of the phase field model is described in our MSMSE 2014 paper. Base on this, a C++ serial code has been written under the MD++ framework. The below figure gives a flow chart of the phase field code, in order to clarify the codes’ structure.
Starting from the serial c++ code, parallelization is done with OpenMP, MPI, and CUDA three approaches. These parallel codes can be downloaded together with the MD++ package from the svn server. The following command will checkout the latest MD++ to your current directory.
svn co https://micro.stanford.edu/svn/MD++/trunk/ ./
After you have the MD++ package (with the revision later than r478), you may follow the below sections, in which we discussed how to compile and run these parallel codes with details.
OpenMP code
src/phasefield_omp.cpp contains our implementation of the OpenMP code. To compile the code, using cluster MC2 as an example, type
make phasefield build=R SYS=mc2_omp
You may check src/Makefile.base to look at the flags we set up for compiling OpenMP. Generally adding the specification of “-openmp” for icc compiler or “-fopenmp” for gcc compiler should make the computer to recognize OpenMP.
In this example, if the code is compiled successfully, the executable should be named as phasefield_mc2_omp in the bin/ folder.
You can specify the number of threads you want to use. For example if you want to have 8 threads for your simulation, you can type the following line in the command window (or include it in the PBS script) .
export OMP_NUM_THREADS = 8
MPI code
The MPI related files are src/phasefield_mpi.cpp, and src/StencilToolkit. Some modifications are also made in src/main.cpp for initialize and finalize MPI
For the current MPI implementation, we adopted the StencilToolkit library developed by KISTI to divide the 3D arrays into designate chunks, considering the periodic boundary and the boundary synchronization. The number of nodes is specified with the following command in the src/phasefield.cpp. n_x, n_y and n_z give the number of nodes in each dimension. _node = new Node3D(n_z, n_y, n_x) These numbers are required to be specified before compilation. In addition, the number of grids should be divisible by the number of nodes for each dimension. For example, NX%n_x = 0. To compile the code, still using cluster MC2 as the example, type
make phasefield build=R SYS=mc2_mpi MPI=yes
After the MPI code is compiled, it should generate the executable phasefield_mc2_mpi in bin/ and a library file libstk.so in the same folder. To make this shared library loaded when the program is running, the following command is required, assuming the current directory is the MD++ home folder.
export LD_LIBRARY_PATH = “./bin:$LD_LIBRARY_PATH”
CUDA code
The CUDA code is implemented as src/phasefield_cuda.cu. The compilation of the code requires nvcc compiler. Here we use cluster Sherlock as an example. You may need to load the CUDA module first by entering
module load cuda
Then type the following command to compile the code,
make phasefield build=R SYS=sherlock CUDA=yes
When the compilation is finished, the executable named as phasefield_sherlock will be created in the src/ folder.
Test cases
We wrote a tcl input script pf3d_test.tcl for code validation and performance evaluation. The initial configuration is set to a spherical liquid droplet at the solid-vapor interface with a box size of 200x200x200. 200 steps’ simulation is run with dynamics_type = 8 (constrain the liquid volume and the droplet’s center of mass position in both x and y directions).
Serial code
The serial code can be considered as the reference state. To run the code,
bin/phasefield_mc2 scripts/work/phasefield/pf3d_test.tcl 0 1
The simulation results are printed on the screen. The followings are the output for the 1st step and the last step.
The output for the first step should be like:
curstep = 0 F = 3.556137471408e+05 Fraw = 3.556137471408e+05 G = 3.160552999005e+01 timestep = 3.16e-05 rel_vol = ( 1.21, 42, 56.8)% M01=0.0943 M02=0.0443 COM_x=-0.998 COM_y=-0.998
The output for the step 200 should be like:
curstep = 200 F = 3.438630312533e+05 Fraw = 3.440320588335e+05 G = 2.139589914093e+01 timestep = 4.67e-05 rel_vol = ( 1.21, 42, 56.8)% M01=-11.2 M02=-11.2 COM_x=-0.998 COM_y=-0.998
The time cost is 42m26.627s.
OpenMP code
Under the MD++ home folder, the following command is used to run the phase field simulation on MC2.
bin/phasefield_mc2_omp scripts/work/phasefield/pf3d_test.tcl 1 1
The output for the first step should be like:
curstep = 0 F = 3.556137471166e+05 Fraw = 3.556137471166e+05 G = 3.160552999005e+01 timestep = 3.16e-05 rel_vol = ( 1.21, 42, 56.8)% M01=0.0943 M02=0.0443 COM_x=-0.998 COM_y=-0.998
The output for the step 200 should be like:
curstep = 200 F = 3.438630312384e+05 Fraw = 3.440320588185e+05 G = 2.139589914093e+01 timestep = 4.67e-05 rel_vol = ( 1.21, 42, 56.8)% M01=-11.2 M02=-11.2 COM_x=-0.998 COM_y=-0.998
The time cost is 11m2.548s.
MPI code
The MPI code can be run interactively by the command
mpirun –np 32 bin/phasefield_mc2_mpi scripts/work/phasefield/pf3d_test.tcl 2 1
The above command runs the job with 32 CPUS, which should be consistent with the number of processors pre-set in src/phasefield.cpp.
The output for the first step should be like:
curstep = 0 F = 3.556137471184e+05 Fraw = 3.556137471184e+05 G = 3.160552999005e+01 timestep = 3.16e-05 rel_vol = ( 1.21, 42, 56.8)% M01=0.0943 M02=0.0443 COM_x=-0.998 COM_y=-0.998
The output for the step 200 should be like:
curstep = 200 F = 3.438630312383e+05 Fraw = 3.440320588185e+05 G = 2.139589914093e+01 timestep = 4.67e-05 rel_vol = ( 1.21, 42, 56.8)% M01=-11.2 M02=-11.2 COM_x=-0.998 COM_y=-0.998
The time cost is 170.368100 s
CUDA code
Assuming the CUDA code has been compiled on Sherlock, to run the simulation interactively, we may need to reserve one GPU first by typing the command,
salloc -N 1 -p gpu --qos=gpu --gres=gpu:1 --constraint="k20x"
This reserves us the K20X GPU. Next to run the phase field simulation with CUDA, you may enter the following command under the MD++ home folder.
srun bin/phasefield_sherlock scripts/work/phasefield/pf3d_test.tcl 3 1
One thing should be mentioned that “model_type = 20” should be specified to call the CUDA multi-phase field function in the input script. For single phase field function, model_type = 10.
The output for the first step should be like:
curstep = 0 F = 3.556137471188e+05 Fraw = 3.556137471188e+05 G = 3.160552999005e+01 timestep = 3.16e-05 rel_vol = ( 1.21, 42, 56.8)% M01=0.0943 M02=0.0443 COM_x=-0.998 COM_y=-0.998
The output for the step 200 should be like:
curstep = 200 F = 3.438633096736e+05 Fraw = 3.440320815069e+05 G = 2.139663658787e+01 timestep = 4.67e-05 rel_vol = ( 1.21, 42, 56.8)% M01=-11.2 M02=-11.2 COM_x=-0.998 COM_y=-0.998
The time cost is 60.879 s.
Summary
From the above test cases, we can find the CUDA code gives the largest speedup factor, which is over 40. In comparison, the MPI code with 32 cores gives a speedup factor of around 14, but the code can be accelerated further with more nodes. In addition, for simulations with extreme large size, MPI is expected to be the better solution, since memory may become an issue for the CUDA code. Though OpenMP code doesn't obtain a very large speedup factor (the speedup factor of OpenMP is about 4), it is very easy to implement and can be applied to a personal desktop or laptop with a reasonable acceleration.
