2024-01-22
CDU (Cooling Distribution Unit) failed, and many compute nodes stopped.
2024-01-22
CDU (Cooling Distribution Unit) failed, and many compute nodes stopped.
2023.8.22
Due to an instantaneous voltage drop in Meguro-ku, Tokyo, which occurred at around 13:06 on Tuesday, August 22, the entire Ookayama campus suffered a power outage, which was restored at around 13:22. The power was restored around 13:22. The TSUBAME computation nodes were mainly affected by the power outage, and restoration work is still in progress at 15:00. The restoration status will be posted here.
The recovery was completed around 16:20 and job processing resumed. (16:30 added)
2022.8.25
A system trouble occurred and has been restored.
1.summary
・failed to submit paid jobs
・TSUBAME portal can not be operated
2.period
since 2022/8/21(Sun) around 00:03 until around 17:04
3.detail
A failure had occurred and has been restored.
1.overview
interactive queue is unavailable
2.period
2021/10/21(Thr) from around 13:05 to around 14:12
3.detail
The following failure occurred and has been restored.
1.summary
Access to /gs/hs1 is stalled or failed.
2.period
2021/5/22(sat) since around 10:00 until around 17:10, after then it is degraded.
3.detail(5/25 updated)
2020.9.10
The following failure had occurred and had been restored.
1.overview
The access to /gs/hs0 was slow/stalled.
2.period
2020/9/8/(Tue) from around 11:15 to around 14:00, on login node, around 17:25, on some compute nodes, around 9/9 13:15
3.detail
4.affected jobs
The jobs likely to be affected by the failure are listed below. TSUBAME points spent on these jobs will be returned automatically at a later date.
2020.2.20
As shown below, a failure occurred in the "Job Scheduler" that manages job submission, deletion, and execution order. The job scheduler has been restored, but corrections to prevent recurrence are expected around the end of the fiscal year.
1. period
Around 7:03 on Saturday, February 15, 2020 to 11:50
Around 10:16 from Thursday, February 20, 2020 to 10:22
2.impact
2020.2.7
In some jobs, users reported that the TSUBAME points settled temporarily after execution (* 1) were not settled correctly (* 2) and all points were returned. (In some cases, the point consumption of a job is treated as 0 points regardless of the execution time.)
2019.9.20
The compute nodes stopped and recovered as follows.
1.period
from 2019/9/19 14:08 to 2019/9/20 10:00
2.content
When we proceeded with the recovery work for Unit 2 of the cooling tower that had been stopped due to a 9/9 failure, all the cooling towers stopped abnormally. As a result, all compute nodes were stopped as well, and the TSUBAME portal was also stopped manually.
2019.9.20
Compute nodes stopped and recovered as follows.
We are sorry for the late posting this information.
1.period
from 2019/9/9 4:56 to 2019/9/10 9:00
2.content
Around 4:56 on September 9, 2019, the rooftop cooling tower leaked due to the typhoon and cooling stopped.
As the result, the compute nodes were sequentially stopped to protect the system from rising temperatures, and all running jobs were stopped.