Restart of Omni-Path network (on 7/25)


The fabric manager that manages the Omni-Path network has failed around 7/10, and it continues to be unable to rejoin the network.
For example, if a failure occurs in a compute node and restart the machine, it will leave the network and then can not be rejoined and can not be used. There is no problem with communication of normal compute nodes already joining the network.

 This problem is expected to be cured by restarting the fabric manager, but since it may affect jobs to a considerable extent, it was scheduled to be executed at the time of a power outage in August. However, from yesterday to today, several dozen computing nodes have left the network due to software failure, and the situation has become a situation where the decrease in the number can not be ignored, so restart the fabric manager on the following schedule.

1. Date

Wednesday, July 25 (Wednesday) 12: 15-14: 15 * It depends on the progress situation

2. Impact

Job submission: Can be launched as usual.

Job that is being executed (file I / O, inter-node communication): Although it is basically not affected, communication may be temporarily interrupted.

Job before execution: During the execution period, the job will not be started and queued. After completion of maintenance, sequential execution resumes.