【Trouble report】 2020.2.15, 2.20 occurred:job scheduler failure(Updated on May. 27)

2020.2.20

 As shown below, a failure occurred in the "Job Scheduler" that manages job submission, deletion, and execution order. The job scheduler has been restored, but corrections to prevent recurrence are expected around the end of the fiscal year.

1. period

Around 7:03 on Saturday, February 15, 2020 to 11:50
Around 10:16 from Thursday, February 20, 2020 to 10:22

2.impact

Jobs cannot be submitted / confirmed from the login node during the above time period. There was no effect on jobs that had already been executed, and unused points were returned sequentially.

3.Causes and workarounds
 The master daemon sge_qmaster crashed multiple times on two redundant hosts jobcon0 and jobcon1 that run the job scheduler. After the crash, the scheduler failed over and returned automatically.
When submitting an array job consisting of many tasks, it was found that the master daemon of the job scheduler crashed. Also, during the investigation of this failure, it was discovered that even with an array job with a relatively small number of tasks, the job could not be submitted due to another problem due to the character limit. In both cases, the cause has been identified and work is underway to correct it (Updated on May. 27)we already fixed this issue during year-end maintenance. So, the following workaround is not needed anymore.

Until the countermeasure is completed, enter the number of tasks using the table below as a guide. Consider reducing the number of tasks by repeating for etc. in the shell script, or consider splitting and submitting.

Average of digits in task number
Estimated maximum number of tasks
3 digits 115
4 digits 92
5 digits 75

In addition, depending on the execution time and other conditions, it may not be possible to execute even with the above number of tasks. If you get the error message "Unable to run job: string is longer than 512, this is not allowed for object names.", You need to further reduce the number of tasks.