Problems | TSUBAME Computing Services

【Failure Report】 Occurred on Jan 19, 2024: Large scale compute node outage due to CDU failure. (Fixed)
2024-01-22
CDU (Cooling Distribution Unit) failed, and many compute nodes stopped.
続きを読む
【failure report】 2023.8.22 : System failure due to power outage at Ookayama Campus
2023.8.22

Due to an instantaneous voltage drop in Meguro-ku, Tokyo, which occurred at around 13:06 on Tuesday, August 22, the entire Ookayama campus suffered a power outage, which was restored at around 13:22. The power was restored around 13:22. The TSUBAME computation nodes were mainly affected by the power outage, ~~and restoration work is still in progress at 15:00. The restoration status will be posted here.~~

The recovery was completed around 16:20 and job processing resumed. (16:30 added)
続きを読む
【failure report】2022.8.21 : paid jobs unavailable, TSUBAME portal unavailable
2022.8.25

A system trouble occurred and has been restored.

１．summary

　・failed to submit paid jobs
　・TSUBAME portal can not be operated

２．period

　since 2022/8/21(Sun) around 00:03 until around 17:04

３．detail
続きを読む
【failure report】2021.10.21：interactive queue failure
A failure had occurred and has been restored.

１．overview

　interactive queue is unavailable

２．period

　2021/10/21(Thr) from around 13:05 to around 14:12

３．detail
続きを読む
【failure report】2021.5.22 : /gs/hs1 failure(5/25 update)
The following failure occurred and has been restored.

１．summary

　Access to /gs/hs1 is stalled or failed.

２．period

　2021/5/22(sat) since around 10:00 until around 17:10,　after then it is degraded.

３．detail(5/25 updated)

　
続きを読む
【failure report】2020.9.8：/gs/hs0 trouble
2020.9.10

The following failure had occurred and had been restored.

１．overview

　The access to /gs/hs0 was slow/stalled.

２．period

　2020/9/8/(Tue) from around 11:15 to around 14:00,　on login node, around 17:25,　on some compute nodes, around 9/9 13:15

３．detail

４．affected jobs

　The jobs likely to be affected by the failure are listed below. TSUBAME points spent on these jobs will be returned automatically at a later date.
続きを読む
【Trouble report】 2020.2.15, 2.20 occurred：job scheduler failure(Updated on May. 27)
2020.2.20

　As shown below, a failure occurred in the "Job Scheduler" that manages job submission, deletion, and execution order. The job scheduler has been restored, but corrections to prevent recurrence are expected around the end of the fiscal year.

1. period

Around 7:03 on Saturday, February 15, 2020 to 11:50
Around 10:16 from Thursday, February 20, 2020 to 10:22

２．impact
続きを読む

【障害報告】一部のジョブ実行後にTSUBAMEポイントが消費されない問題について

2020.2.7

In some jobs, users reported that the TSUBAME points settled temporarily after execution (* 1) were not settled correctly (* 2) and all points were returned.
(In some cases, the point consumption of a job is treated as 0 points regardless of the execution time.)

"trouble report": 2019.9.19 : all of the compute nodes stopped due to cooling system trouble
2019.9.20

The compute nodes stopped and recovered as follows.

１．period

from 2019/9/19 14:08 to 2019/9/20 10:00

２．content

When we proceeded with the recovery work for Unit 2 of the cooling tower that had been stopped due to a 9/9 failure, all the cooling towers stopped abnormally. As a result, all compute nodes were stopped as well, and the TSUBAME portal was also stopped manually.
続きを読む
"trouble report" 2019.9.9 : all of compute nodes stopped due to cooling system trouble
2019.9.20

Compute nodes stopped and recovered as follows.

We are sorry for the late posting this information.

１．period

from 2019/9/9 4:56 to 2019/9/10 9:00

２．content

Around 4:56 on September 9, 2019, the rooftop cooling tower leaked due to the typhoon and cooling stopped.

As the result, the compute nodes were sequentially stopped to protect the system from rising temperatures, and all running jobs were stopped.
続きを読む

Search

言語の切り替え