【Failure Report】 Occurred on Jan 19, 2024: Large scale compute node outage due to CDU failure. (Fixed)

2024-01-22

CDU (Cooling Distribution Unit) failed, and many compute nodes stopped.

  • On Friday, Jan 19, 2024, at around 20:30, a CDU (Cooling Distribution Unit) failed and pumps stopped. As a result, all 144 compute nodes whose hostnames start with r3 and r4 under the control of the CDU were stopped.
  • The failure is still ongoing as of 13:30 on Monday, Jan 22, 2024.
  • Repairs are expected to take several days to complete. A reduced number of compute nodes will be available until the recovery.
  • To equalize job execution opportunities for the reduced number of nodes, various limit values will be changed as follows until restored.
    • The maximum number of nodes to be offered for reservation is 72 nodes.
    • 20 jobs (weekdays)/80 jobs (weekends) per user per job
  • (2024-01-23 12:45) TSUBAME points consumed in normal jobs and completely stopped reservations are already refunded.
  • (2024-01-25 16:30) Due to repair arrangements, repairs are expected to be completed on Feb 1st (Thu) at the earliest.
  • (2024-02-01 14:00) CDU repair is complete, and we will boot the compute nodes this afternoon.
  • (2024-02-01 14:35) Now the affected compute nodes are booted and available.

 

The jobs that may have been affected are as follows


Regular jobs

15057856, 15057861, 15057866, 15062875, 15062882, 15062887, 15062889, 15062891, 15062893, 15062894, 15062897, 15063823, 15063824, 15063825, 15063826, 15063828, 15063829, 15063830, 15063831, 15063832, 15063833, 15063835, 15063836, 15065873, 15065876, 15065891, 15066034, 15066035, 15066043, 15066078, 15066090, 15066091, 15066439, 15066833, 15066873, 15066884, 15067665, 15068532, 15068568, 15068569, 15068574, 15068579, 15068626, 15068627, 15068634, 15068689, 15068814, 15068815, 15068886, 15068916, 15068917, 15068918, 15068920, 15068928, 15068929, 15068946, 15068949, 15068951, 15068979, 15068983, 15068985, 15068986, 15068987, 15069032, 15069062, 15069063, 15069064, 15069066, 15069075, 15069078, 15069169, 15069640, 15069757, 15069758, 15069761, 15069762, 15069763, 15069946, 15069975, 15069976, 15069978, 15069981, 15069990, 15070008, 15070016, 15070021, 15070030, 15070035, 15070041, 15070042, 15070047, 15070049, 15070051, 15070055, 15070062, 15070063, 15070064, 15070067, 15070070, 15070072, 15070076, 15070098, 15070380, 15070649, 15070662, 15070664, 15070702, 15070716, 15070722, 15070728, 15070730, 15070745, 15070748, 15070771, 15070860, 15070970, 15070975, 15070976, 15070977, 15070978, 15070985, 15070986, 15070990, 15070991, 15071043, 15071044, 15071050, 15071063, 15071073, 15071079, 15071080, 15071081, 15071082, 15071107, 15071120, 15071121, 15071122, 15071123, 15071154, 15071157, 15071162, 15071165, 15071289, 15071339, 15071342, 15071348, 15071359, 15071366, 15071367, 15071397, 15071399, 15071400, 15071403

Reserved Jobs

11750, 11845, 11847 Deleted due to failure
11811, 11830, 11836, 11842 Number of nodes reduced due to failure
11812, 11813, 11814 Contains failed nodes, but reallocation of normal nodes was performed because it was before the reservation started.