TSUBAME3 some workarounds of known issues

  • Job hangs or error occurs when NCCL is used

Some problem that job hangs or error occurs when NCCL is used are reported.

In some cases, kernel panic also happens.

If you suspect that you have hit this problem, try the following

export NCCL_IB_DISABLE=1

or

export NCCL_BUFFSIZE=1048576

NCCL_IB_DISABLE=1 may decrease the performance, in that case please use NCCL_BUFFSIZE=1048576.

 

  • segmentation fault occurs when MPI+OpenACC is executed

A problem that segmentation fault occurs when openmpi+OpenACC is used.

As a workaround, please give it a try:

export PSM2_MEMORY=large

or

export OMPI_MCA_pml=ob1

 

  • Error occurs when GPUDirect is used.

(2021/08/19 updated)

This issue has been fixed by the last maintenance.

Some cases that error occurs when GPUDirect is used and when the program is exited normally or abnormally are reported.

This error happens rarely.

This also sometimes triggers kernel panic.

If you suspect this, please turn off GPUDirect as follows.

mpirun ... -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=0

 

  • Job hangs with large scale job

Some problems that mpirun hangs with large scale job are reported before.

This seems to caused by qrsh -inherit that is fork()'ed by mpirun.

If you suspect this, please try the following.

* For openmpi

mpirun -mca plm_rsh_disable_qrsh true -mca plm_rsh_agent ssh ...

* For intel MPI

export I_MPI_HYDRA_BOOTSTRAP=ssh
unset I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS

※Please note that they are effective only for f_node.