- Job hangs or error occurs when NCCL is used
Some problem that job hangs or error occurs when NCCL is used are reported.
In some cases, kernel panic also happens.
If you suspect that you have hit this problem, try the following
export NCCL_IB_DISABLE=1 or export NCCL_BUFFSIZE=1048576 |
NCCL_IB_DISABLE=1 may decrease the performance, in that case please use NCCL_BUFFSIZE=1048576.
- segmentation fault occurs when MPI+OpenACC is executed
A problem that segmentation fault occurs when openmpi+OpenACC is used.
As a workaround, please give it a try:
export PSM2_MEMORY=large or export OMPI_MCA_pml=ob1 |
- Error occurs when GPUDirect is used.
(2021/08/19 updated)
This issue has been fixed by the last maintenance.
Some cases that error occurs when GPUDirect is used and when the program is exited normally or abnormally are reported.
This error happens rarely.
This also sometimes triggers kernel panic.
If you suspect this, please turn off GPUDirect as follows.
mpirun ... -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=0 |
- Job hangs with large scale job
Some problems that mpirun hangs with large scale job are reported before.
This seems to caused by qrsh -inherit that is fork()'ed by mpirun.
If you suspect this, please try the following.
* For openmpi
mpirun -mca plm_rsh_disable_qrsh true -mca plm_rsh_agent ssh ... |
* For intel MPI
export I_MPI_HYDRA_BOOTSTRAP=ssh unset I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS |
※Please note that they are effective only for f_node.