(last update 2019.06.27)
We periodically check the following items:
Check item | On | Monitoring Period |
---|---|---|
UGE down/offline | computation node | 5min. notify in 2 consecutive down times |
Link status, speed, error status of the network | computation node | 20min. |
Systems time synchronization status | computation node | 60min. |
Permissions of device files and mount points | computation node | 60min. |
Status of memory ECC error | computation node | 60min. |
Status of GPU memory ECC error | computation node | 60min. |
Storage free space | /apps/ /home/ /gs/hs0/ /gs/hs1/ /gs/hs2/ |
60min. |
Storage accessibility | computation node | 15min. |
login | 5min. | |
Status of the remaining process | computation node | 10min. |
Scheduler response | qstat command | 5min. |
Status of load balance | computation node | UGE aggregates when the load exceeds threshold value |
Thermal state | computation node | 30min. |
We also check the following items at job start. If any problems are detected, another node will be assigned.
Check item | Support status |
---|---|
Number and amount of CPU, Memory, GPU | available |
OPA HFI status | available |
Lustre mount、Lustre status | available |
GPU health check (dcgmi health -c) | available |
NVMe SSD existence check | available |