Automated System Check

(last update 2019.06.27)

We periodically check the following items:

Check item On Monitoring Period
UGE down/offline computation node 5min. notify in 2 consecutive down times
Link status, speed, error status of the network computation node 20min.
Systems time synchronization status computation node 60min.
Permissions of device files and mount points computation node 60min.
Status of memory ECC error computation node 60min.
Status of GPU memory ECC error computation node 60min.
Storage free space /apps/
Storage accessibility computation node 15min.
login 5min.
Status of the remaining process computation node 10min.
Scheduler response qstat command 5min.
Status of load balance computation node UGE aggregates when the load exceeds threshold value 
Thermal state computation node 30min.

We also check the following items at job start. If any problems are detected, another node will be assigned.

Check item Support status
Number and amount of CPU, Memory, GPU available
OPA HFI status available
Lustre mount、Lustre status available
GPU health check (dcgmi health -c) available
NVMe SSD existence check available