- Acceleration of large scale collective communications for openmpi
There are four OPA units per node on TSUBAME3、only two is used by default.
Explicitly using four of them, as shown below, may speed up collective communication with large communication volume, such as MPI_Alltoall() when the number of nodes is large.
This is especially effective for GPU communication.
if [ $((OMPI_COMM_WORLD_LOCAL_RANK%NUM_HFIS_PER_NODE)) == 0 ];then
- an example job script(f_node=8、GPUDirect on)
mpirun -x PATH -x LD_LIBRARY_PATH -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -npernode 4 -np $((4*8)) ./wrap.sh ./a.out
- Acceleration of pt2 communications for openmpi
Multirailing (bundling) multiple hfi(opa) may speed up p2p communication.
This does not seem to be very effective for GPU communication, but seems to be effective for CPU communication.
To enable multirail, do the following (In the case of openmpi)
|mpirun ... -x PSM2_MULTIRAIL=2 ... ./a.out|
Also, the performance of p2p communication when using multirail seems to depend on the value of PSM2_MQ_RNDV_HFI_WINDOW (default value: 131072, maximum value: 4MB).
For more details of each parameters, please refer to here.
- About Intra node GPU communication
As shown in the Hardware Archtecture, GPU0<->GPU2 and GPU1<->GPU3 are connected by two NVLink lines, which doubles the bandwidth.
If possible, it may speed up your program to have these GPUs communicate with each other.
※This can only be achieved with f_node, since h_node allocates only GPU0,1 or GPU2,3.
- Some tips of openmpi
* Add rank number into the output
|mpirun -tag-output ...|
* Explicitly specify the algorithm of collective communication(By default, MPI dynamically selects an algorithm from communicator size and message size.)
|mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_allreduce_algorithm <algo #> ...|
Algorithm numbers of Allreduce are: 1 basic linear, 2 nonoverlapping (tuned reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
If you want to see the detail, please invoke the following:
|ompi_info --param coll tuned --level 9|
And, if you want to disable tuned collective module, please do:
|mpirun -mca coll ^tuned|
This disables tuned module and switches it with basic module.
This may cause the performance degradation, but it may useful when tuned module is something buggy.
* Use ssh instead of qrsh -inherit ... which is launched by mpirun(f_node only)
|mpirun -mca plm_rsh_disable_qrsh true -mca plm_rsh_agent ssh ...|
By default, the processes are launched via qrsh -inherit, but if you are having problems with that, try this.
* Show current MCA parameter configurations
|mpirun -mca mpi_show_mca_params 1 ...|
Also, MCA parameters can be passed via environment variable as follows.
where "name" is the variable name, "value" is its value.
* Obtaining core file with openmpi job when segmentation fault occurs
When you want to get the core file by segmentation fault etc. with openmpi, it seems that the core file is not obtained even if you use ulimit -c unlimited in the job script.
You can get the core file by wrapping it as follows.
ulimit -c unlimited
|mpirun ... ./ulimit.sh ./a.out|
* CPU bindings
CPU binding options in openmpi are the following.
|mpirun -bind-to <core, socket, numa, board, etc> ...
mpirun -map-by <foo> ...
For more detail, please refer to man mpirun.
If you want to confirm actual binding, try the following:
|mpirun -report-bindings ...|
- Some other tips
* Threshold of GPUDirect of sender side
Threshold of GPUDirect of sender side is 30000bytes by default.
Reciever side is UINT_MAX(2^32-1), so if you want GPUDirect to always be ON even when the buffer size is large, you can do the following on the send side.
|mpirun -x PSM2_GPUDIRECT_SEND_THRESH=$((2**32-1)) ...|
For more detail, please refer to here.