There is a known problem of data corruption by collective communications of GPU buffers on OpenMPI.
(2019.04) This issue is resolved with the maintenance at the end of FY2018.
As a workaround, please give the following a try.
- MPI_Allgather()
mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_allgather_algorithm 2
- MPI_Alltoall()
mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_alltoall_algorithm 3
if the above does not solve the issue, please try the following.
mpirun -mca pml ob1
This issue will be fixed by OPA 10.8 (in the end of the fiscal 2018)