過去のバッチキューの情報

2024.2.1 14:35 CDUの修理が完了して復旧しました。

2024.1.22 11:30 It turned out that CDU repair takes a long. We need several days to get it fixed.

2024.1.19 22:40 144 Nodes are down due to CDU(cooling unit) failure. They will be fixed on Jan 22 or later.

2024.1.19 21:35 Some compute nodes are down, and we are investigating the issue. Estimated downtime is not known.

2022.10.21 12:05 The issue below is resolved around 11:50 today.

2022.10.21 11:45 From 11:00 am today, new jobs cannot be submitted or started because of scheduler failure. Existing jobs seem to continue running.

2022.8.21 20:15 Scheduler trouble is fixed around 17:04. Now all jobs can be submitted.

2022.8.21 12:00 Since 00:00 today, jobs with TSUBAME point consumption cannot be submitted because of job scheduler trouble.

2021.8.18 18:20 計算ノードからの外部接続の問題は解消しました。

2021.8.18 18:00 計算ノードからTSUBAME外へのアクセスができなくなっています。

2021.8.18 17:20 インタラクティブジョブ専用キューの学外ユーザへの有償提供を開始しました。詳しくはこちら

2020.11.13 13:30 Today at around 10:35 a.m. the job scheduler failed to respond on all the computation nodes.We restarted the master of the job scheduler at around 11:40 and the situation has been improved.Now it is possible to submit jobs without any problems.We are currently investigating the detailed cause and the impact on jobs.

2020.9.10 14:30 今回の障害の概要および対応についてお知らせに掲載いたしました。

2020.9.9 17:30 9.9 13:15頃に障害の影響が残っているノードの再起動が完了しております。影響のあったジョブの特定が終わり次第、対応についてお知らせいたします。

2020.9.8 16:30 We're rebooting the login nodes from 17:00, to fix the access problem to /gs/hs0.

2020.9.10 14:30 Announcement about the /gs/hs0 storage and batch queue failure

2020.9.9 17:30 We have finished rebooting affected compute nodes by 9.9 13:15. We'll announce what we will do as soon as we have identified the affected jobs.

2020.9.8 17:30 We configured that the nodes which have trouble for accessing /gs/hs0 will not be assigned to new jobs.

2020.4.10 13:50 Node reservation issues are solved and reservation service is resumed

2020.4.9 15:00 New node reservation request is temporally suspended again to investigate the issue

2020.4.9 15:00 Some reservation queues refuse job submission prior to their start time

2019.9.20 10:00 10:00に再開しました。

2019.9.20 8:50 本日、10:00頃に再開予定です。

2019.9.19 16:25 本日中の復旧は難しく、明日のサービス再開を目指しています。

2019.9.19 15:50 現時点で復旧の見込みが立っていません。なおこの障害の影響が受けたジョブについては後ほどTSUBAMEポイントを返却いたします。

2019.9.19 14:00 Due to repairing cooling system downed by the typhoon, all compute nodes are stopped. Currently working on recovery.

2019.09.09 18:10 Recovery of compute nodes and TSUBAME portal will be at 2019.9.10 9:00 am.

2019.09.09 13:30 Recovery of compute nodes will be on tomorrow.

2019.09.09 09:45 Due to cooling system trouble caused by the typhoon, all compute nodes are stopped. Currently we don't have estimate for recovery time.

2019.09.09 09:45 Due to cooling system trouble caused by typhoon, all compute nodes are stopped.

2019.09.09 09:30 Due to cooling system trouble caused by typhoon, some compute nodes are stopped.

2018.3.9 10:50   特定のユーザジョブで計算機資源が占有されているため、投入量を調整していただくよう連絡を行っています。システム的に制限をかけることが現時点ではできないため、実行数合計が72ノード程度になるよう、調整にご協力いただきますようお願いいたします。

2018.2.21 9:30 There may be an error when submitting a job.

2018.2.19 9:27 A failure occurred in the power supply system, and multiple nodes stopped at around 2/18 16:05. We restored to 2/19 0:35 but the cause is under investigation. We swill post details in a notice later.

2018.2.9 18:00  The problem that resources can not be allocated properly when q_core is specified with more than 2 nodes in parallel has been resolved today.

2018.2.9 18:00  Job scheduler commands such as qsub and qdel was temporally unavailable from 17:00 to 17:45 today. We're investigating root cause. Currently, the scheduler seems working normally.

2018.2.1 17:00  If 2 or more is specified for q_core, resource allocation will not be performed correctly. Please see the Announcements for details.

2018.1.26 10:10  At around 0:32 today, the water leak was detected with a water cooled cooling rack.
As we confirmed water leakage at Rack 1 today at 7:39 today, we shut down emergency computation nodes of 72 units, which is the smallest unit.
Jobs running on these nodes was forcibly terminated. Details will be posted in the announcements later.

2017.12.29 6:00 Some data in job monitoring page are rendered as 0, but job scheduler itself seems working normally. (This problem was fixed at 12.29 11:30)

2017.12.20 16:30  We found problem that GPU is not assigned in resource type s_gpu. We'll stop new execution of jobs which uses s_gpu. Estimated recovery time is not determined yet.

2017.12.1 18:00 15:45頃より断続的にジョブが投入できない事象が発生しています。現在は投入できるようですが、週末に再発の可能性があります。

2017.11.1 14:30 Today's Maintenance was completed at noon.

2017.10.31 10:00 Omni-Path network failure has been recovered.

2017.10.24 10:00 Service stop for TSUBAME3.0 Grand Challenge execution

2017.9.23 19:55   Switching the current group to the TSUBAME group other than tsubame-uses by the newgrp command, you can execute UGE commands such as qsub.

2017.9.23 19:30   it seems that a failure has occurred in the batch scheduler. it is not possible to submit new jobs, display job status and delete jobs.

2017.9.12 12:00 The Omni-Path network recovered around 10:50

2017.9.12 9:30 The Omni-Path network problem has occurred. Fabric Manager and compute nodes will be restarted at 9/12 9: 30 - 12: 00.

2017.9.11 20:30 The Omni-Path network problem has occured. Can not access normally between about 200 nodes of compute nodes and storage (Luster, NFS).

2017.9.1 9:30 Batch queue restarted. You need to set the TSUBAME point again.

2017.8.31 15:30 Scheduled maintenance started.

2017.8.17 12:00 Regulate the amount of batch queue so as not to occupy resources. For details, here.