2022.8.25
A system trouble occurred and has been restored.
1.summary
・failed to submit paid jobs
・TSUBAME portal can not be operated
2.period
since 2022/8/21(Sun) around 00:03 until around 17:04
3.detail
2022.8.25
A system trouble occurred and has been restored.
1.summary
・failed to submit paid jobs
・TSUBAME portal can not be operated
2.period
since 2022/8/21(Sun) around 00:03 until around 17:04
3.detail
A failure had occurred and has been restored.
1.overview
interactive queue is unavailable
2.period
2021/10/21(Thr) from around 13:05 to around 14:12
3.detail
The following failure occurred and has been restored.
1.summary
Access to /gs/hs1 is stalled or failed.
2.period
2021/5/22(sat) since around 10:00 until around 17:10, after then it is degraded.
3.detail(5/25 updated)
2020.9.10
The following failure had occurred and had been restored.
1.overview
The access to /gs/hs0 was slow/stalled.
2.period
2020/9/8/(Tue) from around 11:15 to around 14:00, on login node, around 17:25, on some compute nodes, around 9/9 13:15
3.detail
4.affected jobs
The jobs likely to be affected by the failure are listed below. TSUBAME points spent on these jobs will be returned automatically at a later date.
2020.2.20
As shown below, a failure occurred in the "Job Scheduler" that manages job submission, deletion, and execution order. The job scheduler has been restored, but corrections to prevent recurrence are expected around the end of the fiscal year.
1. period
Around 7:03 on Saturday, February 15, 2020 to 11:50
Around 10:16 from Thursday, February 20, 2020 to 10:22
2.impact
2020.2.7
In some jobs, users reported that the TSUBAME points settled temporarily after execution (* 1) were not settled correctly (* 2) and all points were returned. (In some cases, the point consumption of a job is treated as 0 points regardless of the execution time.)
2019.9.20
The compute nodes stopped and recovered as follows.
1.period
from 2019/9/19 14:08 to 2019/9/20 10:00
2.content
When we proceeded with the recovery work for Unit 2 of the cooling tower that had been stopped due to a 9/9 failure, all the cooling towers stopped abnormally. As a result, all compute nodes were stopped as well, and the TSUBAME portal was also stopped manually.
2019.9.20
Compute nodes stopped and recovered as follows.
We are sorry for the late posting this information.
1.period
from 2019/9/9 4:56 to 2019/9/10 9:00
2.content
Around 4:56 on September 9, 2019, the rooftop cooling tower leaked due to the typhoon and cooling stopped.
As the result, the compute nodes were sequentially stopped to protect the system from rising temperatures, and all running jobs were stopped.
2019.6.12
A network failure happened as follows and it had been recovered.
1. summary
Network failure between campus network and login0 occurred.
2. period
June 11 2019 around 14:10 - 17:00
3. detail
From around June 11 2019 14:10, a trouble that login node(login0) was not accessible from the internet happened.
It had been recovered by restarting login0 at 18:49.
2018.12.25
A storage failure occurred and now temporarily recovered.
1.Summary