Some useful configurations for OpenMPI

  • Acceleration of large scale collective communications for openmpi

There are four OPA units per node on TSUBAME3、only two is used by default.

Explicitly using four of them, as shown below, may speed up collective communication with large communication volume, such as MPI_Alltoall() when the number of nodes is large.

This is especially effective for GPU communication.




    export HFI_UNIT=0
    export HFI_UNIT=1
    export HFI_UNIT=2
    export HFI_UNIT=3



- an example job script(f_node=8、GPUDirect on)

#$ -cwd
#$ -V
#$ -l h_rt=01:00:00
#$ -l f_node=8

. /etc/profile.d/
module purge
module load cuda openmpi/3.1.4-opa10.10-t3

mpirun -x PATH -x LD_LIBRARY_PATH -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -npernode 4 -np $((4*8)) ./ ./a.out


  • Acceleration of pt2 communications for openmpi

Multirailing (bundling) multiple hfi(opa) may speed up p2p communication.
This does not seem to be very effective for GPU communication, but seems to be effective for CPU communication.
To enable multirail, do the following (In the case of openmpi)

mpirun ... -x PSM2_MULTIRAIL=2 ... ./a.out

Also, the performance of p2p communication when using multirail seems to depend on the value of PSM2_MQ_RNDV_HFI_WINDOW (default value: 131072, maximum value: 4MB).

For more details of each parameters, please refer to here.


  • About Intra node GPU communication

As shown in the Hardware Archtecture, GPU0<->GPU2 and GPU1<->GPU3 are connected by two NVLink lines, which doubles the bandwidth.
If possible, it may speed up your program to have these GPUs communicate with each other.
※This can only be achieved with f_node, since h_node allocates only GPU0,1 or GPU2,3.

  • Some tips of openmpi

* Add rank number into the output

mpirun -tag-output ...

* Explicitly specify the algorithm of collective communication(By default, MPI dynamically selects an algorithm from communicator size and message size.)
ex: MPI_Allreduce()

mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_allreduce_algorithm <algo #> ...

Algorithm numbers of Allreduce are: 1 basic linear, 2 nonoverlapping (tuned reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
If you want to see the detail, please invoke the following:

ompi_info --param coll tuned --level 9


ompi_info -all

And, if you want to disable tuned collective module, please do:

mpirun -mca coll ^tuned

This disables tuned module and switches it with basic module.

This may cause the performance degradation, but it may useful when tuned module is something buggy.


* Use ssh instead of qrsh -inherit ... which is launched by mpirun(f_node only)

mpirun -mca plm_rsh_disable_qrsh true -mca plm_rsh_agent ssh ...

By default, the processes are launched via qrsh -inherit, but if you are having problems with that, try this.


* Show current MCA parameter configurations

mpirun -mca mpi_show_mca_params 1 ...

Also, MCA parameters can be passed via environment variable as follows.

export OMPI_MCA_param_name=value

where "name" is the variable name, "value" is its value.


* Obtaining core file with openmpi job when segmentation fault occurs

When you want to get the core file by segmentation fault etc. with openmpi, it seems that the core file is not obtained even if you use ulimit -c unlimited in the job script.
You can get the core file by wrapping it as follows.




ulimit -c unlimited

mpirun ... ./ ./a.out


* CPU bindings

CPU binding options in openmpi are the following.

mpirun -bind-to <core, socket, numa, board, etc> ...
mpirun -map-by <foo> ...

For more detail, please refer to man mpirun.

If you want to confirm actual binding, try the following:

mpirun -report-bindings ...


  • Some other tips

* Threshold of GPUDirect of sender side

Threshold of GPUDirect of sender side is 30000bytes by default.
Reciever side is UINT_MAX(2^32-1), so if you want GPUDirect to always be ON even when the buffer size is large, you can do the following on the send side.

mpirun -x PSM2_GPUDIRECT_SEND_THRESH=$((2**32-1)) ...

For more detail, please refer to here.