Problem with collective communication for GPU memory with OpenMPI

2018.05.31

It was confirmed that MPI_Allgather does not work properly (Communication result can not be obtained correctly) when performing MPI_Allgather on GPU memory using OpenMPI 2.1.1 and 2.1.2 installed in TSUBAME 3.0 It was.

We are currently investigating details of phenomena, occurrence conditions and workarounds. We'll let you know once it's been decided.

(2019.04) This issue is resolved with the maintenance at the end of FY2018.

  • This phenomenon does not occur when CPU memory is used
  • The phenomenon is reproducible, and data abnormality always occurs in situations where this event occurs


06.12 Postscript

We found that it can be avoided by adding the following option as workaround.
mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_allgather_algorithm 2

01-08 Postscript

For more details, please refer the following FAQ:

data corruption by collective communications of GPU buffers on OpenMPI