【Recovered】 Significant Delay of Resumption of TSUBAME

A large problem (*) has occurred in the cooling system, and the restart work has been delayed considerably.
It is impossible to restart on time, and concrete restart time is undecided. Please be aware of future announcements.
* ... Cause specified

Apologize for the last minute notice.

(8/17 13:45 Additional notes)

Cooling TSUBAME 3 has two systems of water cooling mechanism, there are a system introduced at T 2 and a system of evaporative cooling tower introduced at T 3. The problem occurred in the water cooling introduced in T2, and it is mainly used for cooling the storage except the compute node.

Although the cooling function is gradually recovering, the service in of /gs/hs0, /gs/hs1 can not be predicted with insufficient state of operation of the Luster file system as a whole.

However, TSUBAME service in in /gs/hs0, /gs/hs1 is excluded within 1 or 2 hours (although it can not be promised) because it can cool if it is only in the home directory and authentication server doing.

(8/17 14:00 Additional notes)

In order to do service in excluding /gs/hs0, /gs/hs1, we are currently checking the consistency of clearing jobs charged before power failure and billing (point consumption) information. If you resume it while maintaining job information, a large number of job failures occur and points are consumed.

It is expected that load will concentrate once the restart of operation is decided. We ask for your cooperation not to put a heavy load on login node and home directory.

(8/17 14:10 Additional notes)

The job has been cleared. We are doing the final check for partial resumption.

(8/17 14:15 Additional notes)

Although it is the cause of the problem of cooling, due to stoppage of the chiller and the pump due to power outage, a large amount of impurities is generated in the cooling water, the filter is clogged and the circulation stops. Currently it is clogged in about 2 hours and we clean the filter each time.

It has been known that impurities are accumulated in cooling water, but there is no accumulation of impurities to stop operation, and regular removal is a future subject.

(8/17 14:20 Additional notes)

We resumed partial resumption at 14:15. However, /gs/hs0, /gs/hs1, TSUBAME portal can not be used.

(8/17 14:25 Additional notes)

/gs/hs0, /gs/hs1 service-in is not at the moment means for performing efficient removal of impurities of the cooling water, it does not have standing prospect of resumption.

(8/17 14:30 Additional notes)

About batch queue We are currently checking it.

(8/17 14:40 Additional notes)

We released the batch queue and all the compute nodes at 14:37.

(8/17 14:55 Additional notes)

We changed the title "The resumption of TSUBAME is expected to be delayed greatly" to "Significant Delay of Resumption of TSUBAME"

(8/17 16:00 Additional notes)

We are considering emergency water discharge (and water supply) of cooling water.

(8/17 16:50 Additional notes)

Restoration in the midst of today is impossible situation. It is highly likely that the staff will take turns replacing the cooling water using Saturdays and Sundays and the storage restoration work will be done on Monday.

(8/17 18:30 Additional notes)

Today's update of this page is over.

(8/20 10:30 Additional notes)

We have confirmed that water quality has improved by simple washing by exchange work totaling six times of cooling water using Saturdays and Sundays. At the moment the water gauge runs out, but we are thinking that we will judge the start of Lustre as of tomorrow morning while checking whether it will clog again with time.

(8/21 11:30 Additional notes)

Today, the storage service will resume from 13:00 on 21th.

(8/21 13:30 Additional notes)

We resumed storage service from 13:15. The portal is still under maintenance