data corruption by collective communications of GPU buffers on OpenMPI

There is a known problem of data corruption by collective communications of GPU buffers on OpenMPI.
(2019.04) This issue is resolved with the maintenance at the end of FY2018.

As a workaround, please give the following a try.

  • MPI_Allgather()

mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_allgather_algorithm 2

  • MPI_Alltoall()

mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_alltoall_algorithm 3

 

if the above does not solve the issue, please try the following.

 

mpirun -mca pml ob1

 

This issue will be fixed by OPA 10.8 (in the end of the fiscal 2018)