Known issues

 

Issues

Control ID Confirmed Updated Detail
T3KI-20180817 2018/8  

In Abaqus/Explicit the following error may occur at parallel execution

Abaqus Error: Abaqus/Explicit Packager exited with an error - Please see the 
status file for possible error messages if the file exists.
Begin MFS->SFS and SIM cleanup
Fri 17 Aug 2018 10:40:23 AM JST
Run SMASimUtility
Fri 17 Aug 2018 10:40:24 AM JST
End MFS->SFS and SIM cleanup
Abaqus/Analysis exited with errors

as a workaround, please read the countermeasure module as follows:
module load  abaqus/2017_explicit

T3KI-20180731 2018/7  

the job fails with the following error.

xxx:yyy terminated with signal 11 at PC=0 SP=7fffffffa558.  Backtrace:
/usr/lib64/libinfinipath.so.4(+0x45a8)[0x2aaabee125a8]
/lib64/libpthread.so.0(+0x10b20)[0x2aaaaacdeb20]

there seem some reasons to trigger the error.

as a workaround, please try one of the followings.

1. export I_MPI_FABRICS=shm:tcp

2. if the error occurs with the number of processes per node(ppn) = 28, try ppn = 16

3. if the error occurs with mpirun, try to use mpiexec.hydra

T3KI-20180629 2018/6  

timeout error happens randomly by MPI collective functions with both intel MPI/OpenMPI in the large scale.

As a workaround, set the following option before mpirun/mpiexec.hydra in the job script.

export HFI_UNIT=0

T3KI-20180531 2018/5   It was confirmed that MPI_Allgather does not work properly (Communication result can not be obtained correctly) when performing MPI_Allgather on GPU memory using OpenMPI 2.1.1 and 2.1.2 installed in TSUBAME 3.0 It was.

workaround, set the following option variables.
mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_allgather_algorithm 2

detailed information:Problem with collective communication for GPU memory with OpenMPI
 
T3KI-20180420 2018/4  

If openmpi sends and receives 2 bytes of data in large quantities, segmentation fault may occur. As a workaround, set the following environment variables.

export PSM2_MQ_RNDV_HFI_THRESH=128000

T3KI-20180301 2017/12 2018/4/5 It may consume excessively points. We will return it automatically in sequence (no report to individual users). (Apr 5, 2018) Regularly detect problems and return them while aiming at fundamental solution.
T3KI-20171222A 2017/12/22  

Can not change shell from csh to other shell by chsh command.
Workaround:
 1. change to bash by doing the following comannd in csh
  % bash
 2. Run chsh on bash
  $ chsh /bin/bash

T3KI-10171207A 2017/11 2018/4/5 When failing to submit a job, temporary holding of points that are not displayed on the portal also occurs(Problems that consume excess points).Started returning automatically in sequence (no report to individual users).(Apr 5, 2018) Regularly detect problems and return them while aiming at fundamental solution.
T3KI-20171130A 2017/11 2018/4/5 In the portal, "Processing" is displayed even for a finished job, and the temporary hold state of the point is not canceled. (Problems that consume excess points)(Apr 5, 2018) Regularly detect problems and return them while aiming at fundamental solution.
T3KI-20171031A 2017/10 2018/4/5 TSUBAME point usage status may have a negative value in usage history.(Apr 5, 2018) Regularly detect problems and return them while aiming at fundamental solution.
T3KI-20170926A 2017/9/26 2017/12/21

When fnode was spcefied, a problem occurred in the resource map (CPU core and GPU topology) on the batch system. For example, only 21 cores can be used out of 28 physical cores.

We recognize it as a batch system malfunction, being organized with vendor to fix.
(28 Sep.) The vendor also confirmed the problem, and suggested the workaround is being tested.
(29 Sep.) Temporary workaround was implemented at 15:30 on 28th.
(21 Dec.) By the job scheduler update on December 19, problem was fixed. it is under observation whether it operates correctly.

T3KI-20170914A 2017/8/1   Because there is a problem with the operation of the reservation function, we are not going to publish.
T3KI-20170913A 2017/9/13  

In Ubuntu16.04,  TSUBAME 3.0 can not be logged-in by SSH key authentication, prompting the error "sign_and_send_pubkey: signing failed: agent refused operation."

It can be resolved by registering keys in advance with the ssh-add command on the terminal.

T3KI-20170829D  2017/8/29  

When starting Ansys Fluent with Cygwin/X, segmentation fault may occur.

Please try another combination like PuTTy + Xming.

T3KI-20170829C 2017/8/25  

Can not start COMSOL propmting an error message, when connecting from macOS sierra 10.12. XQuartz to TSUBAME 3.0.

It is very likely that malfunction due to compatibility with OpenGL of Mac. Please connect with following option. "$ ssh -YC login.t3.gsic.titech.ac.jp -l USER-ID".

T3KI-20170829B 2017/8/29  

TSUBAME account can not be created unless the TITECH common mail address has been created.

Please get a TITECH mail address first. We also will fix the portal to show that.

T3KI-20170824A 2017/8/24  

Modules may not be loaded in the second and subsequent nodes.

We recognize that LD_LIBRARY_PATH can not be handed over the second and the subsequent nodes because of a problem of job scheduler. Please add the -v option to the UGE script which specifies the environment variable, and specify the library required for calculation.
ex: #$ -v LD_LIBRARY_PATH=/apps/t3/sles12sp2/cuda/8.0/lib64

T3KI-20170822A 2017/8/18  

Froze with qrsh.

This is because flow control is enabled in the setting of the terminal, so that specific operation (Ctrl + s) can not be used when operating on the remote host by rsh. Execute the following command before running qrsh.
stty -ixon

T3KI-20170818B 2017/8/18   Jobs exceeding 24 hours can be submitted even if not on reserved nodes.
T3KI-20170818A 2017/8/18 2017/9/14

LAMMPS can not be executed with multiple nodes.

This is a bug in the job scheduler UGE.
(14 Sep) At the beginning of the job script, specify the -v option like below to avoid the bug.

 #$ -v LD_LIBRARY_PATH=/apps/t3/sles12sp2/cuda/8.0/lib.375.66:/apps/t3/sles12sp2/cuda/8.0/lib64 

T3KI-20170802B 2017/8/1 2017/8/3

Clicking on the link of the group invitation mail does not work properly.

Depending on the e-mail program, "=" at the end of link text may not be included in the link. In that case, copy full text including "=" and paste it on the browser.

 

Resolved

Control ID Confirmed Updated Detail
T3KI-20171221A 2017/12/20 2018/4/5 Since Job scheduler Update on December 19, GPU can not be assured with s_gpu. Supply of resource type_s is stopped because the cause can not be specified. (2018/1/11) Although we have reproduces defects in verification environments so far, no concrete solution has been found and resolution time is undecided. (2018/4/5) It was solved at year end maintenance.
T3KI-20180201 2017/2/1 2017/2/13 If q_core=2 or more is specified, 4 core can not be assigned properly at each node. q_core=1 is no problem.
(2/13) fixed.
T3KI-20170925A 2017/9/23 2017/11/1

qsub, qstat, qdel are fail from 23 Sep. It is the bug of batch scheduler.

Specify your group with the newgrp command as a temporary workaround. Click here for details.

 (11/1) It was fixed in today's scheduler version upgrade.

T3KI-20170829A 2017/8/28 2017/9/14

There is a case that the job usage history of the portal differs from those of login node. For example, STATUS may be displayed as "処理中(r)[being processed]" even though it had ended.

The information on the login node is correct. Also will fix the cumulative usage point. (xx Sep.) Fixed. 

 

T3KI-20170825A 2017/8/23 2017/8/25

There are cases that local scratch area is not created when multiple nodes are used.

Resolved by fixing Batch Scheduler UGE.

T3KI-20170822B 2017/8/1 2017/8/23

When submitting application of TSUBAME 3.0 on the Portal, some people can not create an account with error "An unexpected error has occured, please contact the system administrator". Changing the browser was not effective.

(23 Aug.) Cause identified and fixed.

T3KI-20170803A 2017/8/3 2017/9/14

The following error is displayed at job submission and qrsh execution.
"Unable to run job: master got unknown command from JSV: "ERROR".Exiting."

It is a temporary error, please re-execute.
(14 Sep.) It seems to have been resolved now, please contact us in case of recurrence.

T3KI-20170802F 2017/8/2 2017/9/14

An error occurs when switching languages on the TSUBAME portal.

(xx Sep.) It has been fixed. Please contact us in case of problems. 

T3KI-20170802D 2017/8/1 2017/9/14

Even if I change the password on the portal, it does not show wheter it succeeded or failed.

(xx xx) A dialog for notifying the change results is now displayed.

T3KI-20170802C 2017/8/1 2017/9/14

I can not connect to the storage  service )CIFS).

(xx Sep.) can now be connected.

T3KI-20170802A 2017/8/1 2017/8/7

Can not apply for new applications from some browsers.

(7 Aug.) We have fixed the portal on 3 August, 2017. If it does not work, please use another browser as a workaround.

Windows 10 + IE -> Firefox

MacOS Sierra + Chrome 57 -> Safari

Also, please confirm that JavaScript is enabled. 

T3KI-20170803B 2017/8/3 2017/8/4

The expiration time of the group invitation mail expires in 30 minutes, too short.

The expiration date will be fixed to one week.

(4 Aug.) The expiration date of group invitation mail has been revised to one week. Please also refer to T3KI-20170802B.

Not a bug, by design

Control ID Confirmed Updated Detail
T3KI-20170802E 2017/8/1 2017/9/14

An error may occur in the inquiry form.

It may be a error when character strings that matches a system command such as "chmod" are detected.  For the present, please replace it with double-byte characters as in this sentence.

T3Ki-20170926A 2017/8/1  

An application for access card holder of 8-digit beginning with A is not approved.

Because anyone can create an access card if you want to get it, you need to submit a document proving your identity. Please visit the account aqcuisition page.