⚠️⚠️⚠️運用終了したTSUBAME3のページです⚠️⚠️⚠️

TSUBAME4.0のWebサイトはこちら

General

Account

TSUBAME point and TSUBAME portal

Job Execution (Scheduler)

Application Usage

Tips

Migration from TSUBAME2.5

General

I'm using PuTTY(MobaXterm), What do we need to register at SSH public key in TSUBAME portal?
Warning: If your SSH private key is leaked, your account will be misused by the third-party. Please secure your private key with setting the passphrase.

This article describes how to create SSH keypair to be used TSUBAME3 using PuTTYgen, which will be installed with PuTTY.
MobaKeyGen from MobaXterm has the same functionarity and UI.

You will get a dialog similar to this by executing PuTTYgen:
1. Press "Generate" to create SSH key-pair
  You can adjust key-pair configurations using "Parameters" box, but you don't have to do so in most cases.
2. Press "Save private key" to save a private key file of generated key-pair for future login.
  After registering public key to TSUBAME3, anyone who can read this private-key file can log in to TSUBAME3 with your account. Please keep this file safe, DO NOT carry with USB stick, or send via e-mail etc.
  You can force to enter passphrase to use this key file, by inputting them in "Key passphrase" and "Confirm passphrase" boxes before saving.
3. Copy all texts in "Public key for pasting..." box, paste it into "Enter SSH public key code" in "Register SSH public key" menu in TSUBAME portal, and then submit this.
When you want to login into TSUBAME, open the file saved in step 2. with Pageant (.ppk files are associated with Pageant if you install PuTTY using installer with default options) in advance, or specify the file in "Private key file for authentication" box under "Connection"-"SSH"-"Auth" menu in PuTTY connection setting dialog.
Back to top
What is TSUBAME3.0?
TSUBAME 3.0 is a supercomputer operated and managed by Global Scientific Information and Computing Center (GSIC) of the Tokyo Institute of Technology. TSUBAME 3.0 has a theoretical calculation performance of 47.2 PFlops (half precision) and is expected to be the largest supercomputer in Japan handling a wide range of workloads including big data and AI in addition to conventional High Performance Computing. In addition, pursuing high density and power saving, realizes the theoretical PUE value of 1.033.

Back to top

The combination of compiler and mpi module

GNU, Intel Compiler and OpenMPI can be used in combination.

It is available by loading the compiler module to be used before loading the OpenMPI module.

As GNU is provided by OS, it is 4.8.5.
Please check the available version with the following command.

$ module av

Below is the usage method.

1. Intel OpenMPI

$ module load intel
$ module load cuda
$ module load openmpi/2.1.1
$ mpicc -V
Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.4.196 Build 20170411 Copyright (C) 1985-2017 Intel Corporation. All rights reserved.

2. OpenMPI for GNU

$ module purge /*purge of already loaded module*/
$ module load cuda $ module load openmpi/2.1.1
$ mpicc -v
Using built-in specs. COLLECT_GCC=/usr/bin/gcc COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/4.8/lto-wrapper Target: x86_64-suse-linux Configured with: ../configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,java,ada --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.8 --enable-ssp --disable-libssp --disable-plugin --

For what purpose can TSUBAME be used?
Use of TSUBAME is limited to education, research, clerical work and social contribution purpose only. It can not be used for applications that directly lead to private financial interests. For example, mining virtual currency using block chain technology.

Back to top
When trying to save a file with Emacs during an interactive job, the screen froze
It is caused by the flow control by specific input characters, enabled on the default terminal setting.

Flow control is a function to temporarily hold a data transfer in order to prevent overflow of the receiving side, for example, when the transmission speed exceeds the reception packet speed in data transfer. In general, Ctrl+S is used for pending transfers, and Ctrl+Q is used for restart as control characters.

When editing with Emacs interactively and overwrite saving, you have to enter Ctrl+S, but it is also a flow control character, so the packet will not be transferred and it will be as if it were frozen. To fix, please enter Ctrl+Q.

To disable flow control, you need to execute the following command before running interactive job.

stty -ixon

If you want to always disable flow control, add the above command to .bashrc in your home directory.

Back to top

The basic configuration of Module file

the basic configuration of Module file is described below.

- Module file listed as [Application name]/[Version].

- If you do not specify a version, it loads the preset default version

- If more than one version exists and the default is set, "(default)" is displayed after the version.

Example:

$ module load intel
$ module list

Currently Loaded Modulefiles:

1) intel/17.0.4.196

- For applications with dependencies such as MPI, they can be used by loading them in advance.

Example: namd

If there are insufficient modules, error will be displayed

$ module load namd

namd/2.12(3):ERROR:151: Module 'namd/2.12' depends on one of the module(s) 'intel/17.0.4.196 intel/16.0.4.258'
namd/2.12(3):ERROR:102: Tcl command execution failed: prereq intel

After loading Intel Compiler and cuda, load namd.

$ module load intel
$ module load cuda
$ module load namd
$ module list

Currently Loaded Modulefiles:

1) intel/17.0.4.196 2) cuda/8.0.44 3) namd/2.12

CPU / GPU allocation at resource designation in Univa Grid Engine(UGE)
UGE assigns virtual CPUID / GPUID according to the specified number of resources except f_node.
- In case of CPU
As an example of s_core of a resource type that reserves only one CPU and resource type q_core that reserves 4 CPUs,
When s_core=7 is specified, seven nodes are allocated and 1 core of each node is allocated.
When q_core=7 is specified, seven nodes are allocated and 4 cores of each node are allocated.
- In case of GPU
In the case of resource type s_gpu that reserves only one GPU,
When s_gpu=4 is specified, 4 nodes are reserved and the GPU of each node is virtually assigned as GPU 0.
Just because you secured 4, it does not mean GPU 0, 1, 2, 3.

In h_node which is a resource type that reserves 2 GPUs, 2 GPUs are allocated within the node, in this case GPU 0 and 1 are allocated.
Back to top

Differences between login node and compute node

The difference between login node and compute node is as follows.

1. Hardware

	Login node	Compute node
# of nodes	2	540
CPU	Intel Xeon E5-2637 v4 3.50GHz 16core x 2	Intel Xeon E5-2680 v4 2.40GHz 14core x 2
Memory	64GiB	256GiB
GPU	-	NVIDIA Tesla P100 x4
Interconnect	Intel Omni-Path HFI 100Gbps x2	Intel Omni-Path HFI 100Gbps x4
NVMe SSD	-	Intel SSD DC P3500 2TB

2. Software

	Login node	Compute node
OS	SUSE Linux Enterprise 12 SP2	SUSE Linux Enterprise 12 SP2
Kernel	4.4.74-92.29-default	4.4.74-92.29-default
GPU Driver	-	nvidia-375.66

The login nodes are shared servers and are not assumed to be used for calculation purpose. Please avoid high-load processing such as program execution at the login nodes, execute it on compute nodes though job scheduler.
Please terminate it immediately using kill command when you accidentally executed a high-load program.

How to create an SSH key pair in Linux/Mac/Windows(Cygwin/OpenSSH)

Warning: If your SSH private key is leaked, your account will be misused by the third-party. Please secure your private key with setting the passphrase.

SSH key pair creation method in Linux / Mac / Windows (Cygwin or OpenSSH) is as follows.
Please check man ssh-keygen command for key type difference.
There are correspondence / unsupported types depending on the version of openssh.

ecdsa key type:

$ ssh-keygen -t ecdsa

RSA key type:

$ ssh-keygen -t rsa

ed25519 key type:

$ ssh-keygen -t ed25519

When you execute one of the above commands, you will be asked for the save location as follows.
If there is special circumstance to avoid, such as the same filename is already used for other purpose, just press Enter key to use the default value.
(If you are already using SSH key pair for other sites, you can reuse the same file for TSUBAME)

Generating public/private keytype key pair.
Enter file in which to save the key $HOME/.ssh/id_keytype: (No need to type filename)[Enter]

Then you will be prompted for a passphrase, so enter it.

Enter passphrase (empty for no passphrase): (Set passphrase; What you type will not appear in screen) [Enter]

Re-enter your passphrase for confirmation.

Enter same passphrase again: (Enter the same passphrase again for confirmation; What you type will not appear in screen) [Enter]

A key pair is created and saved to two files. The upper line shows the location of private key, and the lower line shows that of public key. Register the public key via TSUBAME portal.

our identification has been saved in $HOME/.ssh/id_keytype
Your public key has been saved in $HOME/.ssh/id_keytype.pub.
The key fingerprint is:
SHA256:random number:username@hostname
The key's randomart image is:
(Some text specific to the generated key pair will be shown)

Check the file with the following command.

$ ls ~/.ssh/ -l

			drwx------  2 user group     512 Oct  6 10:50 .

			drwx------ 31 user group    4096 Oct 6 10:41 ..

			-rw-------  1 user group     411 Oct 6 10:50 private_key

			-rw-r--r--  1 user group      97 Oct 6 10:50 public_key

パーミッションがあってない場合は以下のコマンドで修正します。

$ chmod 700 ~/.ssh
$ chmod 600 ~/.ssh/private_key

Cannot login to TSUBAME3 (ssh, Permission denied (publickey,hostbased) etc.)
Please refer the following checklist before contacting us for help.

1. Is your account name correct?
Please confirm you are using TSUBAME3 account.
The number of inquiries to fail using TSUBAME2 account is increasing.
Please refer to the following link for how to get the TSUBAME3 account.
http://www.t3.gsic.titech.ac.jp/en/getting-account

2. Did you registered a public key in correct format?
Please confirm that you have registered a public key in OpenSSH format to TSUBAME portal.
You can not login to TSUBAME3 if you registered a public key in PuTTY format.

Please refer to the following links for how to create a key pair.
http://www.t3.gsic.titech.ac.jp/en/node/37
http://www.t3.gsic.titech.ac.jp/en/node/79

Please refer to the following for how to register the public key.
https://helpdesk.t3.gsic.titech.ac.jp/manuals/portal.ja/prepare/#ssh_key

3. Is the command you entered correctly? ( for Linux / Mac / Windows(Cygwin) )
Please make sure your login name and path to your private key file (*) are specified in the command line correctly.

$ ssh <TSUBAME3 account>@login.t3.gsic.titech.ac.jp -i <private key>

Example)
When your login name is gsic_user and the location of a secret key is ~/.ssh/t3-key,

$ ssh gsic_user@login.t3.gsic.titech.ac.jp -i ~/.ssh/t3-key

*: If your private key file is stored in one of the following locations under your home directory (the case that you have not modified the location from the default value), you can omit "-i <private key>" part.
- .ssh/id_rsa, .ssh/id_dsa, .ssh/id_ecdsa, .ssh/id_ed25519
Please refer to the following man command for options of ssh command.

$ man ssh

4. Does the symptom reproduce in another terminal environment?
There are various types of terminal software for Windows.
Please check whether it reproduces even with another terminal software.
If not reproduce, it may be a software-specific problem.
In that case we can not respond to your inquiries, so please understand.

If you do not solve any of the above solutions, please contact us with the following information.

■Operation system (Windows10, Debian10, macOS, Sierra10.12.6 and so on)
■Terminal software and the version (Cygwin, PuTTY, Rlogin and so on)
For details on how to check the version, please refer to the terminal software manual.
For Linux / Mac OS, please send SSH version. You can check with the following command.

$ ssh -V

■The operation you tried. If you get an error, please send the details.
For Linux / Mac OS, please send the output of ssh command with -v option (debug mode) including the command line itself.
Example)
When an account name is gsic_user and the location of a secret key is ~/.ssh/t3-key,

$ ssh gsic_user@login.t3.gsic.titech.ac.jp -i ~/.ssh/t3-key -v
Back to top

How to terminate the programs executed accidentally

Terminate the program according to the following procedure when you accidentally executed a program on the login node where the program execution is prohibited.

See "How to terminate the job submitted to the batch job scheduler" for the deletion of the jobs submitted to the batch job scheduler.

1. Confirm the PID of the process

Show information of the process you want to terminate with top and/or ps commands.

Check the PID with top command.

$ top

Tasks: 1457 total,   1 running, 1441 sleeping, 11 stopped,   3 zombie
%Cpu(s): 78.8 us, 1.3 sy, 0.0 ni, 96.8 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem: 65598488 total, 18563160 used, 47035328 free, 8 buffers
KiB Swap: 7812092 total, 7422860 used, 389232 free. 6553100 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20680 GSIC 20   0 1157756 5.056g 20628 R 1467.0 1.688   0:01.88 python
1 root      20   0 479464 294444   2940 S 0.000 0.449 76:02.24 systemd
2 root      20   0 0 0 0 S 0.000 0.000 11:25.50 kthreadd
      3 root      20   0 0 0 0 S 0.000 0.000   9:48.70 ksoftirqd/0
      9 root      20   0 0 0 0 S 0.000 0.000   0:00.00 rcu_bh
   10 root      rt 0 0 0 0 S 0.000 0.000   0:45.45 migration/0

2. Termination of the process

Terminate the process with kill command.
Specify the PID as the argument. (20680 in this example)

$ kill 20680

3. Conformation that the process is terminated
Use top and ps commands to confirm the process terminated.
if the process is not displayed, it has correctly terminated.
Proceed to 4 if the process does not terminate.

4. Force termination of the process
Execute the command below if the process does not terminate.

$ kill -9 20680

After that, use top and ps commands to confirm the process terminated.

How to get started with TSUBAME3
This part shows the flow until setup an environment for running program.

There are 6 steps necessary to use TSUBAME3.

When step 1 and 2 are done, login is enabled.
To submit jobs, you need to complete additional steps 3 - 5.
In addition to your home directory of 25GiB, if you need additional volumes, do step 6.

1. Getting an account [each user]
2. SSH key pair generation and the public key registration [each user]
3. Creation of a group [group administrator only]
4. Addition of users to the group [group administrator and its members]
5. Point purchase [group administrator only]
6. Setup of group disk [group administrator only]

* [] is the persons who need to do the step.

For details, refer to TSUBAME Portal User's Guide.

For storages provided on TSUBAME3, refer to TSUBAME3.0 User's Guide "3. Storage system".

If you can not login, please troubleshoot according to Cannot login to TSUBAME3.

Back to top
About file transfer
File transfer by rsync, scp, and sftp is available on TSUBAME3.
As well as login, you need to access with your SSH private key which is a pair of your SSH public key registered in TSUBAME3 portal.
Also, please check the settings of the application you are using carefully, as some applications may time out.
To install a file transfer application

If you are using MobaXterm or RLogin, it is easier to use the built-in file transfer function of these software.

If you are using other software such as PuTTY for connection, you need to install a file transfer application such as FileZilla or WinSCP that supports sftp and rsync protocols.
In this case, as well as login, you need to access using SSH private key which is a pair of SSH public key registered in TSUBAME3 portal.
For Filezilla and WinSCP, you can use the .ppk format key files that you usually use for PuTTY.
For details on how to use each software, please refer to the manual of each software.

If the option feature "OpenSSH Client" in Windows 10 is enabled, you can use scp and sftp command from command prompt or powershell.

If you are using Linux/Mac/Cygwin (Windows) etc. (rsync, scp, sftp commands)

In these environments, rsync, scp, and sftp commands are available.

Describes three ways each, rsync, scp, sftp.

rsync:

To transfer from the local to the remote host, execute the following command.
If you set the standard path/file name as the key pair location, the -i option is not required.

$ rsync -av --progress -e "ssh -i Private_Key_File -l Login_Name" Local_Directory Remote_Host:Remote_Directory

Local_Directory is the transfer source, Remote_Host:Remote_Directory is the transfer destination. For example, the command when the user with login name "GSCIUSER00" copies the current directory to /gs/hs0/GSIC of TSUBAME 3.0 using ~/.ssh/ecdsa of private key is as follows.

$ rsync -av --progress -e "ssh -i ~/.ssh/ecdsa -l GSICUSER00" ./ login.t3.gsic.titech.ac.jp:/gs/hs0/GSIC

For details such as how to specify the transfer source and tranfer destination, please execute the following command and confirm it.

$ man rsync

scp:

To transfer from local to remote host, execute the following command.
If you set the standard path/file name as the key pair location, the -i option is not required.

$ scp -i Private_Key_File Login_Name@Remote_Host:Remote_Directory Local_directory

Please enter the suitable phrase for your situation in the < >.For example, the command when the user with login name "GSCIUSER00" copies the current directory to /gs/hs0/GSIC of TSUBAME 3.0 using ~/.ssh/ecdsa of private key is as follows.

$ scp -i ~/.ssh/ecdsa GSICUSER00@login.t3.gsic.titech.ac.jp:/gs/hs0/GSIC .

For details such as how to specify the transfer source and tranfer destination, please execute the following command and confirm it.

$ man scp

sftp:

To transfer interactively, execute the following command.
If you set the standard path/file name as the key pair location, the -i option is not required.

$ sftp -i Private_Key_File Login_Name@Remote_Host

For example, the command when the user with login name "GSCIUSER00" copies the current directory to /gs/hs0/GSIC of TSUBAME 3.0 using ~/.ssh/ecdsa of private key is as follows.

$ sftp -i ~/.ssh/ecdsa GSICUSER00@login.t3.gsic.titech.ac.jp

For details such as how to specify the transfer source and tranfer destination, please execute the following command and confirm it.

$ man sftp
```
 
```
To use CIFS access

In addition, only on-campus terminals can be accessed via CIFS.

The CIFS address is \\gshs.t3.gsic.titech.ac.jp.
Please note that even in the campus network, CIFS may be blocked by routers on the way, in which case it cannot be used.

For storages provided on TSUBAME3, refer to 3.3. Storage service (CIFS)" in TSUBAME3.0 User's Guide.
Back to top
I'm a beginner, I don't know what to do.
The content depends on what you are a beginner for.

1.Beginners of UNIX/LINUX
Upon using TSUBAME 3, users are required to master UNIX/Linux proficiently levels. Handbooks are made on this assumption

If you do not understand the content of the handbooks, please read the UNIX/Limux beginner's book at the library, and understand how to use UNIX shells and commands.

You should find this helpfull.

There are various publications on the operation of the "terminal emulator" software. Please check this according to your application software too.

First, please understand the operation of UNIX/Linux, then read our guidebooks, then check the section 3.

2.Beginners of supercomputer
If you have used UNIX/Linux but never used a job scheduler, please read the section No.5 "Job Scheduler" of TSUBAME 3.0 User's Guide.

3.Beginners of TSUBAME 3
Users who used TSUBAME 2.5 in the past.
Technical speficications differ between TSUBAME 3.0 and TSUBAME 2.5.

TSUBAME 2.5 binaries do not work in TSUBAME 3.0, and also binary just recompiled the source code made on TSUBAME 2.5 with TSUBAME 3 may not work.

Please confirm the technical specification of TSUBAME 3.0 and recompile it with appropriate modification if do not work.

TSUBAME 3.0 system configuration is described in Hardware, System Software, Application Software, and "Over view of TSUBAME3" of "Introduction to TSUBAME (Linux basics)".

And please check the section "Migration from TSUBAME 2.5" of FAQ and the TSUBAME 3.0 User's Guide.

4.Beginners of ISV application software.
Please check the application software guide for each. In addition, TSUBAME 3.0 regularly conducts workshops, check the page of seminar.

Back to top
Session suddenly disconnected while working on TSUBAME3
For TSUBAME3, session timeout is set for security measures.
Sessions that do not have any input at a certain time are disconnected.
Even if the GUI application is started up and operated, if there is no input to the terminal, it will be disconnected.

If you want to avoid this, please set keep alive on the terminal side.
Please check the user guide of the terminal for keep alive setting.

Back to top
Can not establish CIFS connection to the group disk
When trying to access a group disk from Windows Explorer, it may be displayed as follows.
The problem may occur for the account which created early (before September 2017).

This is due to the fact that the password of TSUBAME account has expired.

Please update your password from the TSUBAME Portal Page.

Back to top
How to solve "Disk quota exceeded" error
This message indicates there is no space left in ether a home directory or a group disk.

When you face it, you should delete unused files or purchase an additional group disk to keep enough free disk space.

Please note that temporary files are generated at the home directory in some cases, and an application sometimes needs over 25 GB of a disk space for creating temporary files. (25 GB is the capacity of a home directory)

To avoid running out of disk space, we recommend not to use home directory but to use local scratch area or shared scratch area for temporary file location.

Related FAQ
- How to check TSUBAME points, group disk usage, home directory usage
- How to use scratch area
Back to top
FAQ about group disk (as of 1 May, 2018)
About group disk

The group disk is a shared storage that can use the capacity set by the TSUBAME portal for each TSUBAME group written in TSUBAME3.0 User's Guide "3.2 High-speed storage area"
- Usage period: From purchase date to the end of the fiscal year (end of March).
　For example, if you purchase 10TB on 1 May, 3,960,000 points are reuired (36,000 points x 10TB x 11months(until the end of March).
Even if you buy 10TB on 31 May, which is the end of the month, it will be 3,960,000 points as you would purchase on 1st day.
- Purchase unit: 1TB unit(2,000,000 inode)
  Up to 300TB per group.
　If unused, you can receive return points by reducing purchase capacity.
　For example, if you purchase 1TB in April, and delete all data in May to reduce capacity, 396,000 points (36,000 points x 1TB x 11months (until the end of March) will be returned.
- How to set: you can set from the TSUBAME Portal
Reference:TSUBAME Portal User's Guide "10. Management of Group Disk"

What is the group disk grace period ?

Group disks are reset once at the end of the fiscal year, and all group disks are in a grace state that can only be read/deleted.
This period is called the grace period, and usually it will be maintained around the middle of April (17 Apr. on this year).

Reference:TSUBAME Portal User's Guide "10.2. About the validity period of the group disk"

If the data of the previous year remains and you purchase it after the grace period, it becomes as follows.
For example, if you purchased 50TB in the previous year and you used a capacity of 45TB.

1) When 45TB is deleted during the grace period and the used capacity is 0.
You can purchase from 1TB which is the minimum capacity.

2) When 25TB is deleted during the grace period and the used capacity is 20TB.
Available from over 20TB.

2) If the used capacity is not deleted during the grace period (used capacity is 45TB)
Available from over 45TB.

*If you do not need the previous year's data, please delete it during the grace period.

Related FAQ

Checking the usage of group disks with command
Can not establish CIFS connection to the group disk
"Disk quota exceeded" error is output
Back to top
How to change the GPU clock
The GPU clock can be changed only with f_node.
- Display available clock frequencies
nvidia-smi -q -d SUPPORTED_CLOCKS
- Changing the clock
nvidia-smi -ac specified clock

ex.)
nvidia-smi -ac 715,999
- Resetting the clock
nvidia-smi -rac

Device is specified with -i.
For details, see the command help.
Back to top
About the IP address of the gateway server for compute nodes(connection to license servers outside TSUBAME, etc.)
The IP address range of the compute node gateway server is as follows.
131.112.3.250-131.112.3.253

When computing on TSUBAME by using a campus or university license server, please set so that communication within the above range is permitted.

Please keep in mind that the above address may be changed without notice from the circumstances of operation.

If your software requires communication with a license server outside of TSUBAME (e.g., in a laboratory), please confirm that you can communicate with the license server from a network outside of TSUBAME and outside of the license server before contacting us with the following information.
- Global IP address of the license server
- Port number of the license server (or all ports if there are more than one)
- IP address of the host where the communication test was performed
Back to top
I want to copy a large amount of data from/to TSUBAME
Please consider the following topics to improve the performance of data transfer between TSUBAME and external computers.

Pack the files to appropriate size

Large amounts of small files reduce transfer speed. Pack such files using the tar command to archives of 1GB size each.

Change transfer protocols

If you do not get enough speed with scp / sftp, consider using rsync or CIFS(Tokyo Tech users only) protocols.

For more details on the CIFS connection, please refer "Storage Service (CIFS)" section of the TSUBAME users guide.

Remove the bottleneck on the network route
- If you have old LAN cables (CAT-3 or CAT-5 (not CAT-5e)), switching hubs, or routers whose link speed is lower than 1000 Mbps, replace them with newer ones.
- When using a router (WiFi router, NAT router, broadband router, etc.), connect your computer to the external network (in Tokyo Tech, IP address starting with 131.112 or 172.16-31) directly.
For details of the network at Tokyo Tech, please contact the network administrator of the laboratory. If you are not sure, please contact the branch manager for each building or organization.

(Tokyo Tech Users Only) Use the iMac terminal of Education Computer Systems

If it is difficult to change the network configuration, you can bring your HDD to the GSIC and connect it to the iMac terminal of the Education Computer Systems in the exercise room to transfer the data. Please check the opening hours.

Terminal room location and Opening Hours (in Japanese)

Hardware (in Japanese)
Back to top
About common errors in Linux
Here we have a FAQ on Linux common errors.
For details on how to use the described command, please check with the man command etc.

1.No such file or directory

There is no required file or directory.
It occurs when specifying a nonexistent file, directory name, etc., typing, or incorrect path specification.
Also, depending on the application, it may occur when the line feed code is CR + LF on windows.

Measures
Please review the file and directory name carefully.
Also, please check FAQ "The job status is "Eqw" and it is not executed."about the newline character.

There are related errors as follows.
error while loading shared libraries: ****.so: cannot open shared object file: No such file or directory
This is an error that occurs when there is no library required by the program or can not be read.

Measures
Please check with ldd command.
There is a way to set the environment variable LD_LIBRARY_PATH, explicitly specify the library at compile time, and so on.

2.command not found

The command you entered does not exist.
This happens when the environment variable PATH setting is not successful or when there is no command.
It is likely to occur when TSUBAME does not execute module command or if/etc/profile.d/modules.sh file is not loaded.

Measures
If the module command is not executed, execute the following command beforehand.

$ . /etc/profile.d/modules.sh

If the software is installed by yourself, check the environment variable PATH.

3.Permission denied

You are not authorized to perform the operation you attempted to perform.
Linux and user and group permissions are set on a file / directory basis.
Check the authority of the target file or directory you want to read or write or execute with the following command.
(When checking the hoge file for an example)

$ ls -l hoge

Measures
If you are trying to create files in / etc, / lib etc which are system directories etc, please make it in the user directory.
If it occurs in a user directory such as a group disk, check the authority and please do.

4.Disk quota exceeded

Please check FAQ How to solve "Disk quota exceeded" error"".

5. Out Of Memory

This error occurs when memory runs out.

Measures
Change the resource type to one with more memory capacity.
Divide the memory usage per node with mpi etc.

Related FAQ「Check the detail of an error message printed the log file」

Related FAQ
"Disk quta exceeded" error is output
The error when executing the qrsh command
Check the detail of an error message printed the log file
"Warning: Permanently added ECDSA host key for IP address 'XXX.XXX.XXX.XXX' to the list of known hosts." in the error log
The range of support by T3 Helpdesk about the program error such as segmentation fault
Error handling for each ISV application

Back to top
Available SSH client on Windows
The following SSH clients on Windows are available to connect TSUBAME.

OpenSSH client (Windows 10 functionality)

OpenSSH client can be installed via [Apps]-[Manage optional features] section in Settings app.

ssh, ssh-keygen, etc commnads(same as linux) are available after the installation.

PuTTY

Official Site

PuTTY is a free SSH Client software.
Please refer to this article to generate the SSH key.

Window Subsystem for Linux

Linux environments can be constructed on Windows by downloading Linux distribution(such as Ubuntu, OpenSUSE) from Windows 10 store.

ssh, ssh-keygen commands are available from that.

Cygwin

Official Site

Cygwin provides a pseudo-Linux environment on Windows.

ssh, ssh-keygen commands are available from that.

Back to top
An error such as "fork: Resource temporarily unavailable" is displayed on the login node.
The login node has a limitation of 50 processes per user.
Therefore, if you create a process that exceeds the limit, you will get an error like this.
For more information, please refer to Please refrain from occupying the CPU in the login nodes..

Back to top
To allow other members to read and write on a group disk
Note: This article is about group disks (/gs/hsX/), do not run the following sample in your home directory.

Users are not allowed to change the owner of their files. Therefore, please change the group permissions so that it can be read and written. The point is,
- Change permissions for all files and directories below the directory, not just the top-level directory.
- Add read (R) as well as write (W) permissions to the file. If you don't have a write (w), you can't erase it later.
- The directory should contain not only read (r) but also write (w) and execute (x). You can't access the directory without the execution (x).
　Some example commands are shown below. Depending on the original permissions of the file, some errors may occur, in which case, try re-running the command until the output no longer changes.

Find your own directories under /gs/hsX/tgX-XXXXXX/ and make them readable and writable by group members.

find
/gs/hsX/tgX-XXXXXX/
-type d -user $USER ! -perm -2770 -print0 | xargs -r0 chmod -v ug+rwx,g+s

Find your own files under /gs/hsX/tgX-XXXXXX/ and make them readable and writable by group members.

find
/gs/hsX/tgX-XXXXXX/
-type f -user $USER ! -perm -660 -print0 | xargs -r0 chmod -v ug+rw

Find your own files under /gs/hsX/tgX-XXXXXX/ and match the ownership group to the TSUBAME group.

find
/gs/hsX/tgX-XXXXXX/
-user $USER ! -group (TSUBAME group name) -print0 | xargs -r0 chgrp -v (TSUBAME group name)

We have prepared a script that automatically executes the above commands. Please note that we do not guarantee the operation of this script, and you use it at your own risk.

module load takeovertool
cd /gs/hsX/tgX-XXXXXX
fixperm
Back to top
How to synchronize data between TSUBAME and PC
The advantage of the rsync command is that it transfers only the difference. If the transfer is interrupted for any reason, you can start again, or if you run it again after a certain period of time, you can transfer only those files that have changed their content. Data deleted from the source can also be deleted at the destination for complete synchronization.

　An example command is shown below. It's a good idea to check the log or run it multiple times, in case the command fails along the way.

Synchronize TSUBAME with the data of the terminal on your local PC.

rsync -auv (source directory) (your login name)@login.t3.gsic.titech.ac.jp:(full path of the destination directory)

Synchronize TSUBAME data to the terminal on your local PC.

rsync -auv (your login name)@login.t3.gsic.titech.ac.jp:(full path of the source directory) (destination directory)

Back to top
How to write acknowledgments in a paper using TSUBAME?
Please refer to the following page for an example of how to write an acknowledgement.
Please note that this is just an example, and you may adjust the description to match the description of other supercomputers or research funds.

Please mention TSUBAME usage in acknowledgement of publications

In addition, please submit reports on your use of TSUBAME, such as bibliographic information, through TSUBAME Portal to help us understand how TSUBAME is being used.
Please refer to the following User's Guide for how to submit usage reports.

TSUBAME portal User's Guide 12.Management of TSUBAME usage report

Back to top
How many the number of files is acceptable per one directory?
As the number of files per directory increases, the processing time for metadata operations (file creation, deletion, and opening) on the files under the directory increases, or the file system may generate errors, making it impossible to create files.

Even when using a group disk, it is recommended to arrange files in a hierarchical manner so that there are no more than 100,000 files per directory.

Example:
- NG: 000000.dat ～ 999999.dat
  - If a million files are placed flat in one directory, the load during file access will increase, causing performance degradation and failure.
- OK: 000/000000.dat ～ 000/000999.dat, 001/001000.dat ～ 001/001999.dat, …
  - The hierarchical arrangement minimizes the cost of file system operations by limiting the number of files per directory to about 1000.
Back to top

Account

Can not find user even though searched when adding a user to a group
If the target user can not be searched on the user addition screen of the portal, there are the following reasons.
- TSUBAME account application has not been completed
  The TSUBAME account is not automatically created as you enroll, and each user needs to apply for an account via the Tokyo Tech Portal. Until the account application is completed, it will not appear in the account search by the group administrator.
  In addition, if the applicant is an owner of an access card, the account application from the TSUBAME portal will only be valid if it has the same status as the Tokyo Tech staff/student. Also, because it is not automatically approved, it is necessary to send a document certifying your identity. So please complete the procedure with reference to the explanation of that page.
  For details of account application, please refer to "Getting accounts".
- Attribute of user and attribute of TSUBAME group do not match
  When creating a TSUBAME group, you should have set up what kind of users can belong as group divisions. You can not add users who are out of that condition. For example, you can not add external users to "Only teachers and students at Tokyo Institute of Technology Group". In such a case, a message like "This user is not eligible to participate" is displayed. Because group classification can not be changed afterwards,so please create a group of appropriate attribute that allows you to subscribe to that user.
  In addition, if a new entitlement to an existing account occurs, such as when a user using HPCI newly started a joint research with Tokyo Institute of Technology with competitive funding, by appending an existing account name when applying for the new qualified account, you will be able to belong to that group.
Back to top
I've moved on, but my TSUBAME account login name is still old.
If you need to change your login name for higher education, etc.
GSIC uses the information as of the first day of each month to make a batch change around the 10th of the month.

Therefore, there is no need for you to apply for or re-register for a new TSUBAME account.
Until the change, please use your old TSUBAME login name.

Please note that the timing of the change may not coincide with the timing of the Tokyo Tech IC card, which is the source of the information, and it may take a long time for the change to be completed.

If you wish to change your login name as soon as possible for some special reason, please contact us using the contact form.

Related Link:

　How to get an account/account login name
　https://www.t3.gsic.titech.ac.jp/getting-account#login_name

Back to top

TSUBAME point and TSUBAME portal

TSUBAME portal is not displayed, or does not work properly
-If JavaScript is disabled, please enable it and try again.
-In some environments using Internet Explorer, there was a problem that TSUBAME portal account application screen was not displayed properly. We have fixed on August 3, 2017, but please let us know if you have any problems.
-If it does not work with your browser, please use another browser such as Edge, Chrome, Firefox, Safari.

Back to top
The screen does not display properly in the process of joining the group
If you can not join the group just by displaying the login screen, when clicking on the URL on the mail titled "TSUBAME 3.0 TSBAME group user invitation", please check below.

- Login and keeping the logged-in browser open, click the mail link again.
- Depending on the mailer, the end character "=" may not be included in the link. In that case, copy paste it to the browser including "=".
- The deadline of the invitation URL is one week. If it expires, try to join the group again.

Back to top
Time lag between synchronizing with the login node of the information updated by the portal system
It may take up to 5 minutes for the information to be reflected on the login node after adding users, groups from the portal system. If information is not updated from the login node, please wait for about 5 minutes before operation.

Back to top
Supplement for TSUBAME point
TSUBAME point is consumed when you submit a job or purchase a group disc.

1. How to confirm points
You can check the following information at TSUBAME3 Portal and command line.
In the case of the portal, you can check it below.
- Number of points consumed for each job submitted to the batch job scheduler
- Points consumed in purchasing group disc
For the command line please check the following
FAQ How to check TSUBAME points, group disk usage, home directory usage

2. Consumption of points
Points will be consumed based on Article 13 here.
The points consumed depend on the user and usage.
Please confirm here for details.

3. Purchase points
Points can be purchased at the TSUBAME 3 portal.
For details, please check TSUBAME Portal User's Guide.

4. Point expiration date
Points purchased within the fiscal year will expire at the end of that fiscal year. For details, please refer to the following (in Japanese).

8.1. Points available for purchase and validity period
https://helpdesk.t3.gsic.titech.ac.jp/manuals/portal.en/point/#point_expiration

課金等に関する取扱い「第6条　ポイントは，購入した年度に限り有効とする」("Treatment concerning billing. 6. points are valid only for purchased fiscal year")
http://www.somuka.titech.ac.jp/reiki_int/reiki_honbun/x385RG00001339.html
Back to top
The status of "Payment code status" on TSUBAME portal remains "予算責任者の追認待ち利用停止"
"予算責任者の追認待ち" is a state waiting for approval from the budget manager of applied payment code.
Please ask the budget manager to approve in TSUBAME portal.

Back to top

How to check TSUBAME points, group disk usage, home directory usage

TSUBAME points, group disk usage status, home directory usage status can now be checked with commands.
The command "t3-user-info" can be used only with the login node(login0,login1) .
It can not be executed on compute nodes.

Each command option is required for confirmation.

GSICUSER@login1:~> t3-user-info
usage: t3-user-info [command] [sub command] [option]
[command] [sub_command] [option]
group point : Output the points of the Tsubame Group.
: Without option : show all belonging groups.
[-g] <group name> : extract the specified group.
disk group : Output the purchase amount and use amount of Tsubame Group disk.
: Without option : show all belonging groups.
[-g] <group name> : extract the specified group.
home : Output the use limit and used of the home disk.

Command examples for checking group points.

In the following example, it is assumed that "GSICUSER" user participating in the group "GSIC_GROUP", "GSIC" executes the command.Please use "your user name" and "participating group name" when actually executing the command.

1.When checking the situation of all participating groups

You can check the situation where TSUBAME points of "GSIC_GROUP" and "GSIC" of participating group are 17218631 and 95680000 , and deposit is 0, 124000 spectively.

GSICUSER@login1:~> t3-user-info group point

gid group_name deposit balance
--------------------------------------------------------------
0007 GSIC_GROUP 0 17218631
0451 GSIC 124000 995680000

2.When checking the status of the designated group

You can check the situation where the TSUBAME point of the specified GSIC_GROUP group is 17218631 and deposit is 124000.

GSICUSER@login1:~> t3-user-info group point -g GSIC_GROUP
gid group_name deposit balance
--------------------------------------------------------------
2007 tga-hpe_group00 124000 17218631

To check the usage status of the group disk

In the specified GSIC_GROUP group, only / gs / hs1 is purchased, about 60 TB of the 100 TB quota limit is used,
Regarding the inode limit, we can check the situation of using 7.5 million out of the 200 million quota limit.

GSICUSER@login1:~> t3-user-info disk group -g GSIC_GROUP
/gs/hs0 /gs/hs1 /gs/hs2
gid group_name size(TB) quota(TB) file(M) quota(M) size(TB) quota(TB) file(M) quota(M) size(TB) quota(TB) file(M) quota(M)
--------------------------------------------------------------------------------------------------------------------------------------------------
0007 GSIC_GROUP 0.00 0 0.00 0 59.78 100 7.50 200 0.00 0 0.00 0

When checking the use status of the home directory

Of the 25 GB quota limit, 7 GB is used,
Regarding the inode limit, we can check the situation that we are using approximately 100,000 out of the 2 million quota limit.

GSICUSER@login1:~> t3-user-info disk home
uid name b_size(GB) b_quota(GB) i_files i_quota
---------------------------------------------------------------
0177 GSICUSER 7 25 101446 2000000

When are TSUBAME points consumed and returned?(liquidation of temporary points).
TSUBAME3 collects TSUBAME points as "temporary points" that are expected to be needed when submitting a job, and settles the actual consumption points after the job is finished.
The timing of TSUBAME point consumption and return is as follows.
1. When a job is submitted or qrsh command is executed
  The maximum TSUBAME points that a job can consume are collected as temporary points.
2. When a job is terminated
  The timing of settling provisional points is different in the following three ways
  1. When a job finishes as usual, or when a job is canceled by the qdel command after it has started running
    Recalculate the consumption points based on the time actually used by the job, and return the difference immediately.
  2. When a job is canceled by the qdel command before execution starts, or when the qrsh command fails to start
    The provisional points will remain collected and the temporary points will be automatically returned within three days of the cancellation or execution failure.
    Please contact us if the display on TSUBAME portal remains "Processing" for more than 3 days.
  3. When a job is deleted by the scheduler or the system administrator due to a system error
    When a job is deleted by the scheduler or the system administrator due to a system error, it is basically settled based on the time when the job was deleted as in a.
    If the cancellation is clearly due to system reasons, we will compensate you for the wasted TSUBAME points if you contact us.
If you have any questions about TSUBAME points, please contact us using the inquiry form with the following information.
- The TSUBAME3 group that submitted the job
- User who submitted the job
- Job ID
Reference:
FAQ: The error when executing the qrsh command
Back to top
The link to TSUBAME Portal in the auto-mail does not work properly.
Some links automatically generated by the TSUBAME Portal, such as links to approval pages of TSUBAME group invitations, may not function properly depending on your browser or mailer environment. Specifically,
- The login page of the TSUBAME Portal is displayed instead of being redirected to the correct page.
- The message like "The referenced URL has expired" is displayed.
If you see this kind of symptom, try copying and pasting the address sent by e-mail to the address bar of the window in which you are logged in from the Tokyo Tech Portal to TSUBAME Portal, and then press the Enter key.

If you still think the system is not working properly, please contact us using the contact form with the following information.
- Your TSUBAME account login name
- The date and time of the email sent to you
- The approximate date and time of the first click on the link
- Other related TSUBAME Group information, etc.
Back to top
My application for a payment code was rejected, but why?
When you apply for a payment code on TSUBAME Portal, you will receive a notice of rejection from the TSUBAME Portal if your application is incomplete.
If your application is incomplete, you will receive a rejection notice from TSUBAME Portal. Please check the reason for the rejection in the "Comments from the system administrator" section of the notice, and make corrections if necessary.
The following is a list of typical reasons for rejection and what you need to check when you resubmit.

1. The existence of the corresponding budget could not be confirmed (該当する予算の存在を確認できませんでした)

The above message is sent when the person in charge has checked the existence of the budget code and budget name pair entered in the payment code application on the financial accounting system, but could not find the corresponding budget, including the possibility of a typo.

If the budget is for external funds or Grant-in-Aid for Scientific Research, please make sure that the budget code for the current fiscal year has already been created in the Request for Supplies system.
Also, please make sure that the budget code, budget name, budget department, budget category, and person in charge of the budget are entered correctly, because if they are not, we will not be able to search them.
In particular, the budget code (32 digits) and budget name have a large number of digits in the financial accounting system from FY2020, and some applications may be missing some of them when transcribing, so we strongly recommend that you follow the procedure below to obtain them.
- Log in to the New Goods Request System
- Select Budget Management
- Select Check Budget Execution Status
- Select Create CSV
- Post the "Budget Code" (32 digits) and "Budget Name" in the output CSV file.
2. I was able to confirm that I had the budget code, but XXX was different. (予算コードがあることまでは確認できましたが、XXXが異なっていました)

The above message is sent when a budget that seems to be applicable is found based on the information entered in the payment code application, but some items do not match the registered contents on the financial accounting system and are incorrect beyond the scope that can be corrected at the discretion of the person in charge as a typo.

Please refer to the previous section, check the registration details displayed in the goods billing system, and then reapply with the correct details.
Please refer to the previous section to check the registration details displayed in the requisition system, and reapply with the correct details. Please note that if the XXX part is the name of the person responsible for the budget, please refer to the next section.

3. I was able to confirm that there was a budget code, but the name of the person responsible for the budget was different. (予算コードがあることまでは確認できましたが、予算責任者氏名が異なっていました)

The above message will be sent when the budget manager's account information on the payment code application does not match the budget manager's information on the financial accounting system.

TSUBAME requires budget managers, including those who do not directly use TSUBAME's computing resources, to obtain a TSUBAME account, agree to the Terms of Service, and fill out their account information in order to ensure that we can communicate with them about their billing needs.
In addition, if the user of the payment code (the main administrator of the TSUBAME group) is different from the budget manager, the budget manager is required to approve the application for the payment code on the TSUBAME portal in order to confirm that the use of the budget is allowed in advance.

4. Rejected because the expense category is not "Other" (費目がその他ではないため、却下します) (Grant-in-Aid for Scientific Research)

You will receive the above message when you specify a budget code for an expense category other than "Other" of the Grant-in-Aid for Scientific Research (goods, travel, and personnel expenses) in the payment code application.

There are four types of budget codes for Grants-in-Aid for Scientific Research, and TSUBAME's computer usage fees are classified as "Other" expenses. (This corresponds to "Facility and equipment usage fees within the research institution" in the table of expense categories common to all government ministries and agencies.)
For this reason, applications for payment codes for expenses other than "Other" will be rejected as an incorrect expense category.

5. If you are not a faculty or staff member, you must be responsible for your own budget. (教職員以外の方は、自身が予算責任者となっている申請に限らせていただいております)

The above message is sent to non-faculty members when they apply for a payment code for a budget for which they are not the budget manager.

Non-faculty members are not allowed to be the payer of TSUBAME for budgets other than those for which they are the budget manager (e.g., Grant-in-Aid for Scientific Research by JSPS Postdoctoral Fellows).
If you need to use such a budget, please ask an appropriate faculty or staff member to be the payer and reapply for the payment code.
If you have already created a TSUBAME group, please change the group administrator (main) if necessary.

6. The period in which new claims can be generated has already passed. (既に新規請求事項発生可能期間を過ぎています)

The above message will be sent when you cannot purchase new TSUBAME points for your requested budget.

In TSUBAME, purchase operations on the TSUBAME portal are grouped together on a monthly basis, and the budget is transferred to the next month or later.
We have established a period of availability for payment codes so that this transfer process can be done within the budget's available period (within the fiscal year and research period), but we will not accept applications for budgets that have already passed this period and cannot be used to purchase points.
Please note that during the period from January to March, only corporate operating expenses and scholarship donations can be registered and used for payment codes, but the amount used from January to March will be charged to the same budget for the following year.

7. The system administrator has already approved the application for the same fiscal year and the same budget code. (既に同じ利用年度かつ同じ予算コードの申請をシステム管理者が承認済みです)

You will receive the above message when a payment code has already been approved for the same budget.
Please use the approved payment code.
Back to top

Job Execution (Scheduler)

I get an error when submitting a job, but I do not know which option is bad
The correspondence varies depending on the error.
- qsub: Unknown option
The "qsub: Unknown option" error also occurs when there is an error in the line description starting with "#$" in the job script, besides the option of the qsub command. A common mistake is putting a space before and after the character "=". Please try deleting the space around "=".
- Job is rejected, h_rt can not be longer than 10 mins with this group
If you do not specify the TSUBAME group with the -g option or newgrp, it is considered "Trial run".
"Trial run" has limitation only within 10 minutes, this error occurs when the specified h_rt option specifies more than 10 minutes

In case of "Trial run" please modify the h_rt option to 0:10:0.
If you want to execute it other than "Trial execution", specify the TSUBAME group with the -g option or newgrp.
* In this case, please confirm that you are participating in the appropriate TSUBAME group and that the TSUBAME group has points.
For the TSUBAME group please check TSUBAME portal usage guidance
- Unable to run job: Job is rejected. h_rt must be specified.
It can not be executed because there is no description of h_rt option. Please set time and execute.
- Unable to run job: the job duration is longer than duration of the advance resavation id AR-ID.
This error occurred because you specified a time longer than the reserved time.
Please refer the reservation of FAQ below.

Related FAQ
About specification of batch job scheduler
- error: commlib error: can't set CA chain
This error occurs when the certificate file required for submitting the automatically generated job in the home directory does not exist or is broken.

If you face this error, please try to re-generate it by logging in to TSUBAME after executing the following.

$ cd $HOME

$ mv .sge .sge.back
Back to top

The job status is "Eqw" and it is not executed.

There is a possibility of a system failure, but it may be due to a job script mistake.

Please confirm with the following command.

$ qstat -j <job ID> | grep error

Please confirm the following points. After checking, delete jobs with "Eqw" status with qdel command.

Example)
When there is a problem with file permission.

error reason 1: time of occurrence [5226:17074]: error: can't open stdout output file "<
File of the cause>": Permission denied

When there is no line feed code problem, directory does not exist, or job script is invalid.

error reason 1: time of occurrence [5378:990988]: execvp(/var/spool/uge/<hostname>/job_scripts/<jobID>, "/var/spool/uge/<hostname>/job_scripts/<jobID>") failed: No such file or directory

1. The line feed code of the job script is not in UNIX format.

If the line feed code is set to CR + LF on windows, it will also occur, so please confirm with the actual script together.

You can confirm with the file command.

$ file <script file name>

#Output in case of CR + LF

<Script file name>: ASCII text, with CRLF line terminators

#Output in case of LF

<Script file name>: ASCII text

You can also confirm cat command.

$ cat -e <script file name>

#Output in case of CR + LF

The end of line is displayed as ^M$

#!/bin/bash^M$ #$ -cwd^M$ #$ -l f_node=1^M$ #$ -l h_rt=0:10:00^M$ ./etc/profile.d/modules.sh^M$ module load intel^M$

#Output in case of LF

The end of line is displayed as $

#!/bin/bash$ #$ -cwd$ #$ -l f_node=1$ #$ -l h_rt=0:10:00$ ./etc/profile.d/modules.sh$ module load intel$

Do not edit scripts on Windows
When editing a script on Windows, make sure to check the line feed code by using an editor corresponding to the line feed code.
Correct the line feed code to LF with nkf command.
- In case of other than LF, execute below command.
  
  $ nkf -Lu file1.sh > file2.sh

Note: file1.sh is an original file (before conversion) and file2.sh is a converted file, respectively. Both file names must be different. If their names are identical, it will be corrupted.

2. There is no such directory to be executed

Occurs when the execution directory described in the job script does not exist.

Please confirm with the following command.

$ qstat -j <job ID> | grep ^ error

error reason 1: 09/13/2017 12: 00: 00 [2222: 19999]: error: can not chdir to / gs / hs 0 / test - g / user 00 / no - dir: No such file or directory

3. The job script is described in the background job (with "&")

It will not be executed if it is entered and submitted as a background job (with "&") as shown below.

Example)

#!/bash/sh #$ -cwd #$ -l f_node=1 #$ -l h_rt=1:00:00 #$ -N test

./etc/profile.d/modules.sh

module load intel

./a.out &

4. When there is no file permission

Please set permissions appropriately.

Example) Grant read and execute permission to myself

$ chmod u+rx script_file

5. Disk Quota

Please check the group disk quota.
It is about 2 million inodes per 1 TB.

Please refer to FAQ below.

Checking the usage of group disks with command

How to terminate the job submitted to batch job scheduler

See "How to terminate the programs executed accidentally" for the deletion of the processes running on the login nodes.

When the job-ID is known

Terminate the job with qdel command as follows.

$ qdel job-ID

If job-ID is 10056, type

$ qdel 10056

When the job-ID is unknown

Confirm the job-ID with qstat command, then incompleted jobs of the user are displayed.

Example: When GSIC user confirms the incompleted jobs, displayed as follows.

$ qstat
job-ID prior name user state submit/start at queue jclass slots ja-task-ID ------------------------------------------------------------------------------------------ 10053 0.555 ts1 GSIC r 08/28/2017 22:53:44 all.q 28 10054 0.555 ts2 GSIC qw 08/28/2017 22:53:44 all.q 112 10055 0.555 ts3 GSIC hqw 08/28/2017 22:53:45 all.q 56 10056 0.555 eq1 GSIC Eqw 08/28/2017 22:58:42 all.q 7

TIPS. Status of jobs

state	説明
r	Running
qw	Waiting in order
hqw	Waiting for other jobs to finish because of the dependency
Eqw	Error for some reason

Delete jobs with Eqw by yourself. See here for the cause of it.
Refer to here if you want to change the status of a job to hqw.

I'd like to check the congestion status of compute node
Please check a stacked line chart in Job monitoring page.

To check whether there are free compute nodes, see the green area on the chart.

Back to top
How to use scrath area
TSUBAME 3.0 provides the following scratch area.
For details, please refer to Storage use on Compute Nodes in the TSUBAME3.0 User's Guide

1. Local scratch area
The environment variable $TMPDIR that is allocated only on compute node is the local scratch area.

$ TMPDIR is usually a unique directory for each job under /scr.

You can not write directly under /scr.

2. Shared scratch area
It is available only for the batch job script using f_node of resource type F. Please specify "#$ -v USE_BEEOND=1".
/beeond directory is allocated.

3. /tmp directory
There is a 2GB capacity limit for the /tmp directory.
There is a concern that problems such as hanging up of the execution program may occur when creating large scratch files.
Please consider using the scratch directories in 1. and 2.

Back to top
SSH login to compute nodes
For SSH login to compute nodes only f_node is possible.
Please use f_node when executing applications that use SSH when doing MPI communication.

For details, please check the section 5.7 SSH login of "TSUBAME 3.0 User's Gude".

Back to top
Submission of dependent job
If you want to execute batch job A-2, as soon as the batch job named A-1 finishes, please use the -hold_jid option to submit the job as shown below.

$ qsub -N A-1 MM.sh
$ qsub -N A-2 -hold_jid A-1 MD.sh

If you issue the qstat command afer submission, the status will be "hqw".

Back to top

Want to execute multiple calculations at once in a batch job

If you want to multiple calculations in one job by executing batch, for example executing the four commands exec1, exec2, exec3, exec4 at onece, write the batch script as follows.

#!/bin/sh #$ -cwd #$ -l f_node=1 #$ -l h_rt=1:00:00 . /etc/profile.d/modules.sh module load cuda/8.0.61 module load intel/17.0.4.196 exec1 & exec2 & exec3 & exec4 & wait

The above is only an example.

If you want to execute programs located in different directories at onece, you need to write the executable file from the path. For example, if you want to directly execute a.out in folder1 of the home directory, you specify as below.

~/folder1/a.out &

If you need to the directory of the executable file and execute it.

cd ~/folder1 ./a.out &

Or,

cd ~/folder1 ; ./a.out &

If the last line of the script file ends with "&", the job wil not run.

Do not forget to write the last wait command of the script.

Calculation starts on the login node before executing the qsub command

Typing the commands as described in the manual, the calculation starts on the login node before executing the qsub command.

GSICUSER@login1:~> #!/bin/bash GSICUSER@login1:~> #$ -cwd GSICUSER@login1:~> #$ -l f_node=2 GSICUSER@login1:~> #$ -l h_rt=0:30:0 GSICUSER@login1:~> . /etc/profile.d/modules.sh GSICUSER@login1:~> module load matlab/R2017a GSICUSER@login1:~> matlab -nodisplay -r AlignMultipleSequencesExample

This is because you are directly executing commands on the shell that you need to write in the batch script.
Instead of executing them directly, create a batch script file and specify it with the qsub command.

If you are not familiar with terms such as batch script and shell, please see "1.Beginners of UNIX/LINUX" at "I'm a beginner, I don't know what to do."

The error when executing the qrsh command
Explain the error when running qrsh.

1.Your "qrsh" request could not be scheduled, try again later.
The error above indicates that there are no available vacant resource for interactive job.
Please retry it after the resource become available.

See "I'd like to check the congestion status of compute node" for the status of the use of compute node.

2.Job is rejected. You do NOT have enough point to finish this job
This error indicates that there is no TSUBAME points required for assuring the node.
Please check the point balance.

Reference: FAQ "How long will it take for TSUBAME points to be returned?*

3.Unable to run job: unable to send message to qmaster using port 6444 on host "jobconX": got send error.
Exiting.

This error occurs when the UGE server side is under heavy load.
Please wait a while and try again.

Back to top

Check the detail of an error message printed the log file

The following message may be printed to the log file in some case.

/var/spool/uge/hostname/job_scripts/JOB-ID: line XX: Process-ID Killed Program_Name

In this case, type the qacct command to check the job in detail.

$ qacct -j JOB-ID

The following is an output example of the qacct command. (Excerpt)

==============================================================

1.Example when the memory resource is exceeded

$ qacct -j 4500000

qname all.q
hostname r0i0n0
group GSIC
owner GSICUSER00
project NONE
department defaultdepartment
jobname SAMPLE.sh
jobnumber 4500000
taskid undefined
account 0 0 1 0 0 0 3600 0 0 0 0 0 0
priority 0
cwd /path-to-current
submit_host login0 or login1
submit_cmd qsub -A GSICGROUP SAMPLE.sh
qsub_time %M/%D/%Y %H:%M:%S.%3N
start_time %M/%D/%Y %H:%M:%S.%3N
end_time %M/%D/%Y %H:%M:%S.%3N
granted_pe mpi_q_node
slots 7
failed 0
deleted_by NONE
exit_status 137
maxvmem 120.000G
maxrss 0.000
maxpss 0.000
arid undefined
jc_name NONE

you need pay atttention to exit_status, accont and maxvmem in the example.
exit_status provides the cause of the error by exit code. The exit_status 137 indicates 128 + 9, but since the status occurs in various problem, you may not determine.

Then check the granted_pe and maxvmem.

The "0 0 1 0 0 0" of account shows which resource type and how much it used.
The space type indicates the resource type of f_node, h_node, q_node, s_core, q_core, s_gpu, and the number indicates the resource amount. In this example, one q_node is used.
The maximum memory usage, respectively.
It is estimated that 120 GB of memory was about to be actually used although up to 60 GB is available in q_node according to the User's Guide.

In TSUBAME, the job is killed automatically if the job used more memory size than assigned.

2.Example when the reserved time is exceeded

$ qacct -j 50000000
qname all.q
hostname r0i0n0
group GSIC
owner GSICUSER00
project NONE
department defaultdepartment
jobname SAMPLE.sh
jobnumber 50000000
taskid undefined
account 0 0 1 0 0 0 3600 0 0 0 0 0 0
priority 0
cwd /path-to-current
submit_host login0 or login1
submit_cmd qsub -A GSICGROUP SAMPLE.sh
qsub_time %M/%D/%Y %H:%M:%S.%3N
start_time %M/%D/%Y %H:%M:%S.%3N
end_time %M/%D/%Y %H:%M:%S.%3N
granted_pe mpi_f_node
slots 7
failed 0
deleted_by NONE
exit_status 137
wallclock 614.711
maxvmem 12.000G
maxrss 0.000
maxpss 0.000
arid undefined
jc_name NONE

you need pay atttention to exit_status, wallclock in the example.
exit_status provides the cause of the error by exit code. The exit_status 137 indicates 128 + 9, but since the status occurs in various problem, you may not determine.

So I will focus on account, wallclock.
The seventh digit of the account space break indicates the time (sec) for securing resources.
In this example it is 600 seconds.

Wallclock shows the elapsed time, which is 614 seconds in this example.

Since the calculation did not end within the resource securing time, it can be inferred that the job was forcibly terminated

Related URL: About common errors in Linux

How to transfer X with qrsh
This FAQ will explain how to transfer X with qrsh.
In this method, you can use GUI applications on other than f_node.
Please follow the procedure below.

(Preliminary Work)
Enable X forwarding and ssh to the login node.
Reference: FAQ "X application (GUI) doesn't work" section 1 and 2.

1.After logging in to the login node, execute the following command.
In the example below, GSICUSER will use s_core from login1 for one hour.
Please change the blue letter according to your group, resource, and time you want to use

※With the scheduler update implemented in April 2020, you no longer need to specify -pty yes -display "$DISPLAY" -v TERM /bin/bash when executing qrsh.

GSICUSER@login1:~> qrsh -g GSICGROUP -l s_core=1,h_rt=0:10:00

2.Run the X application you want to use.
The following is an example with imagemagick.

GSICUSER@r1i6n3:~> . /etc/profile.d/modules.sh
GSICUSER@r1i6n3:~> module load imagemagick
GSICUSER@r1i6n3:~> display

Notes
-Depending on the GUI application, there are applications that can not be activated or calculated by limiting memory or SSH.
-For memory, please use the appropriate resource type.
-Fluent can not be launched due to SSH restriction. To avoid this, use the -ncheck option (not supported by manufacturer).
-Schrodinger can be launched but can not compute by SSH restriction. You can use on f_node only.

Back to top
"Warning: Permanently added ECDSA host key for IP address 'XXX.XXX.XXX.XXX' to the list of known hosts." in the error log
It is the message that sytstem added the certificate of IP address XXX.XXX.XXX.XXX in the known_host file(SSH server certificate list) , when there is a node connected for the first time or when the certificate of the host which had previously connected has been changed. This is a normal operation, and it does not affect the calculation result and can be ignored.

Back to top
Errors and remedies of qsub command execution
```
 
```
This section explains the error messages that occurs after executing qsub command and its remdy.

Unable to run job: Job is rejected because too few parameters are specified.

A required parameter is not specified. You need to specify resource type and number of resources, and execution time.

qsub: Unknown option

There is an roor in qsub option specification. Please refer to this.

Unable to run job: Job is rejected. core must be between 1 and 2.

3 or more resources per job can not be used for trial execution. Specify 1 or 2 as the number of resources.

Unable to run job: Job is rejected, h_rt can not be longer than 10 mins with this group.

For trial execution, you can not submit jobs whose execution time exceed 10 minutes. Please refer to this.

Unable to run job: Job is rejected. You do Not enough point to finish this job.

Points to secure the specified resources and time are insufficient. Please check the point status from the TSUBAME portal page.

Unable to run job: failed receiving gdi request response for mid=1 (got syncron message receive timeout error). or Unable to run job: got no response from JSV script"/apps/t3/sles12sp2/uge/customize/jsv.pl".

Communication with the job scheduler will time out and the above error message may be displayed if the management node becomes in a state of high load due to a large amount of job input in a short time. The state of high load is temporary. Please try again after waiting a while.
Back to top

About specification of batch job scheduler

TSUBAME 3 uses the batch job scheduler

Resource Type

There are six available resource types as follows. Specify the resource type with "-l" option. (The "-pe" and "-q" options are not available.)

	Resource Type Name	No. of used CPU core	memory (GB)	No. of GPU
F	f_node	28	235	4
H	h_node	14	120	2
Q	q_node	7	60	1
C1	s_core	1	7.5	0
C4	q_core	4	30	0
G1	s_gpu	2	15	1

Job submission method

Job can be submitted from the login node with the following command.
- Submission by job script (when user belonging to GSICGROUP executes train.sh)

qsub -g GSICGROUP train.sh

- When executing an interactive job (when a user belonging to GSICGROUP uses s_core under X environment for 2 hours)

qrsh -g GSICGROUP -l s_core=1,h_rt=2:: -pty yes -display $DISPLAY -v TERM /bin/bash

For details of how to input jobs, such as how to specify the resource type by submission by job script, please refer to the usage guide.
User's Guide 5.2. Job submission

Also, please check the related FAQ below for the items not explained here.
Related FAQ
How to use scratch area
About submission method of dependent job
How to transfer X with qrsh

About job limit

Please check "Various limit value list" about the current limit.
If the submitted job exceeds the per-user limit, it will be kept in wait state "qw" even though there are enough idle nodes in TSUBAME3.
Once the other jobs terminate and the job fits in the per-user limit, it becomes running state "r", if there's enough idel nodes.

About reservation

Reservation can be set in units of one hour and the node can be used 5 minutes before the reservation end time.
When submitting a job, it needs to be executed with the following command. AR ID can be confirmed on the portal.

$ qsub -g GSICGROUP –ar ARID YOURSCRIPTFILENAME

Since it is used up to 5 minutes before the reservation end time, you need to devise the -l option of the job script.
Example) Resource specification when reservation period is 2 days

#$ -l h_rt=47:55:00

"Reservation" does not apply to the above "Job limits", and has the "Reservation" restriction.
Please check "Various limit value list" about the current limit.

Please check the related FAQ below for coping with error.
Related FAQ
"qsub: Unknown option" error occurs when submitting the job, but I do not know which option is bad
The job status is "Eqw" and it is not executed.
The error when executing the qrsh command
Check the detail of an error message printed the log file

About troubleshooting at reservation execution
We summarize the troubleshooting when jobs can not be submitted during reservation execution.
The following command is an example where the GSIC group executes the AR number 20190108 which is used on 2days.

1.Forgot to add ARID

Example of NG
When the following command is executed, it is executed as a normal job.

$ qsub -g GSIC hoge.sh

OK example
Be sure to use the -ar option when making reservation execution.

$ qsub -g GSIC -ar 20190108 hoge.sh

2.h_rt longer than reserved time

If the h_rt option time specification is longer than the reserved time, the job will not flow.
Also, because it is a specification that will be used 5 minutes before the reservation end time, please shorten the specified time by 5 minutes from the reservation time.

Example of NG
It is not executed because reservation time is full.

$ grep h_rt hoge.sh
#$ -l h_rt=48:00:00
$ qsub -g GSIC -ar 20190108 hoge.sh

OK example (end time is -5 minutes)

$ grep h_rt hoge.sh
#$ -l h_rt=47:55:00
$ qsub -g GSIC -ar 20190108 hoge.sh

When executing after the reservation start time, such as when the program terminates abnormally or when a job can not be submitted before the reservation start time, it is necessary to consider elapsed time.
For example, if you submit a job after 2 hours from the reservation start time, it will be the following script. (When one minute of internal processing time from qsub command execution to allocation of compute nodes)

$ grep h_rt hoge.sh
#$ -l h_rt=45:54:00
$ qsub -g GSIC -ar 20190108 hoge.sh

Related URL

TSUBAME3.0 User’s Guide "5.3. Reserve compute nodes"

TSUBAME portal User's Guide "9. Reserving compute nodes"

About specification of batch job scheduler

Main differences between TSUBAME 2.5 and TSUBAME 3.0 ( node reservation )

Back to top

About the difference between h_rt and actual execution time

The time specified by h_rt also includes the time for preparation processing to execute the job submitted by the user. Therefore, the time specified by h_rt does not become actual job execution time.

The points to be consumed are calculated based on the job execution time excluding the preparation processing time. And the preparation process time is not constant because it varies depending on the status of the node where the job is to be executed.

A "command not found" error occured in qrsh/job script
This error occurs when the module command has not been initialized.

The module command can be initialized by adding . /etc/profile.d/modules.sh before module load XXXX,

実行シェルがsh, bashの場合でintelモジュールを読み込む場合

The execution shell is sh, bash, load intel module

. /etc/profile.d/modules.sh
module load intel

The execution shell is csh, tcsh, load intel module

source /etc/profile.d/modules.csh
module load intel

pip等でインストールしたコマンドをジョブスクリプト・qrshから実行した場合に、"command not found"エラーが発生した場合はログインノードにて

If "command not found" error occurred when executing a command which is installed by external installer such as pip, please try the following on the login node:

$ type <command>
<command> is hasehd (/path/to/<command>)

Then you can confirm the path, and add the following to the job script:

export PATH=$PATH:/path/to

Here, /path/to is the directory where the command is located.

related URLs

About common errors in Linux

User Guide

Back to top
run some programs on different CPUs/GPUs in a job script
It is possible to run some programs on different CPUs/GPUs as follows.

In this example, a.out uses CPU0-6+GPU0, b.out uses CPU7-13+GPU1, c.out uses CPU14-20+GPU2, d.out uses CPU21-27+GPU3.

#!/bin/sh
#$ -cwd
#$ -V
#$ -l f_node=1
#$ -l h_rt=00:30:00

a[0]=./a.out

a[1]=./b.out

a[2]=./c.out

a[3]=./d.out

for i in $(seq 0 3)
do
export CUDA_VISIBLE_DEVICES=$i
numactl -C $((i*7))-$((i*7+6)) ${a[$i]} &
done
wait

Back to top
Web service(jupyter lab) did not start
When web service(jupyter lab) can not start, please check the following points.

check the log and investigate what is happening

Output files of web services are saved under ~/.t3was/.
There might be some hints in the files.

Initialize the environment

Initialize python 3.6 environment such as moving ~/.local/lib/python3.6/ to another directory, and then do

$ python3 -m pip install --user (modulename)

module conflict might be solved.

if web service was able to start by doing this, install some necessary modules from the jupyter lab console, and then

$ python3 -m pip check

this checks the dependencies and update problematic modules by doing

$ python3 -m pip install -U --user (modulename)

, the problem can be solved.

Check and resolve module dependencies, avoiding initialization of the environment

This is almost the same as the abobe, after SSH'ing to TSUBAME, do

$ module load jupyterlab/3.0.9

then the environment of Jupyter Lab in the web service is loaded.(3.0.9 is necessary)

After this, execute

$ python3 -m pip check

and check if there is problematic modules in the dependencies.

If there is, do

$ python3 -m pip install -U --user (modulename)

and update them.

Back to top

Application Usage

How to use Python packages on PyPI (e.g. Theano)
You can install modules into your home directory (Example: Theano case)

$ module load python-extension/2.7 $ pip install --user theano

If you want to use modules from your compute job, add following lines into your job script before python executables.

. /etc/profile.d/modules.sh module load python-extension/2.7

related URL
How to install numpy, mpi4py, chainer etc. using python/3.6.5

Back to top
X application (GUI) doesn't work
In this page, X applicatoin indicates the application that is installed in TSUBAME3 and can work on the X environment, that is GUI application.

Please check the troubleshooting below.

1. X server application is installed and active on the client PC

■Windows
There are a lot of X server applications for Windows.
Please confirm that one of them is installed and on active.

■Mac
Please confirm XQuartz is installed and configured.
https://support.apple.com/ja-jp/HT201341

■Linux
Please confirm both of the X11 server application and its libraries are installed.

2. The X transfer option in the terminal is enabled.
■ A Terminal on Windows (Except for Cygwin)
The setting method differs depending on your terminal and X server application.
Please check the manual of each application.

■Linux/Mac/Windows(Cygwin)
Please confirm the ssh command contains the option -y and -c (these are the options for X transfer)

$ ssh account_name@login.t3.gsic.titech.ac.jp -i key -YC

Example: in case of gsic_user as account_name and ~/.ssh/t3-key as key, then

$ ssh gsic_user@login.t3.gsic.titech.ac.jp -i ~/.ssh/t3-key -YC

Please refer to the output of the following command for ssh option.

$ man ssh

3. Error reproduces in another terminal/Xserver
There are various free terminal softwares/X server applications for Windows.
Please check the same error occurs another terminal/X server.
It may be due to compatibility between terminal and X server.
It may be compatible with isv application.
If it does not reproduce in other applications, there is a possibility that it is an application specific problem.
In that case we can not respond even if you contact us, please understand.

In addition, depending on the X application, command options may be required.
Please check the manual of X application you want to use.

Some GL applications that do not work with normal X forwarding/VNC connection may work with VirtualGL, so please give it a try if needed.

For the detail of VirtualGL, please refer to User's Guide.

4. Operation check
If it is in an interactive node, the standard terminal emulator of X Window System is started with the following command. Please confirm whether to start.

$ xterm

If xterm works but the X application you want to use does not work, please try "3. Error reproduces in another terminal/Xserver"

Example of failure

xterm: Xt error: Can't open display:
xterm: DISPLAY is not set

Please check 1 and 2 if the error occurs.

5. Application use
Do not execute programs that occupy the CPU at login nodes.
Please use compute nodes for Full-scale use including visualization.

Please refer to the FAQ below for information on using the GUI application at the compute node.
Reference: FAQ "How to transfer X with qrsh"

When using f_node, X transfer can be performed with the ssh -Y command.

Please inform us of the following when you inquire
■Operating System you use (例 Windows10,Debian10,macOS Sierra10.12.6)

■Terminal environment that the error occurs (Cygwin, PuTTY/VcXsrv, Rlogin/Xming)

■Version
For Windows, the both versions of the terminal and X server application.
see the manuals for applications for checking versions.
Please inform the version of SSH in case of using Linux/Mac with the command below.

$ ssh -V

■Please send us the contents you tried so far, or if you get an error, please describe the error.

Back to top
I would like to use an application not provided by TSUBAME 3
Please check if it applies to the following items.
If applicable, you can install it freely at your own risk.
Please check the installation manual and the license agreement of the application.
- Works with OS installed in TSUBAME(SUSE Linux Enterprise Server 12 SP5). Software requiring Windows or Mac OS won't work.
- Not requiring administrator privilege (root) to install it.
- Possible to install it to your own home directory or group disk. (It is not allowed to install it to any specified nodes' local disk.)
- With a valid license.
- Not requiring the change to the settings for the kernel, libraries or the system itself.
- If only under these conditions, you can install it and use it on your own responsibility.
- No need for GSIC support.
As described above, GSIC will not help anything about the applications brought by users, as we do not know anything about it.

In case of problems, users themselves must distinguish whether it comes from the application itself or the general issue of TSUBAME, and ask application vendors for application-specific problems.

The versions of libraries and drivers may be changed at the time of the regular maintenance of TSUBAME etc. In that case, you might need to reconfigure the application you had used. Please be aware of the risk of losing compatibility in the future.
Back to top
There is a problem with the operation of the distribution software
Since most of the troubles arising with regard to the distribution software are caused by the environment, we do not support them individually.
Please solve yourself as you accepted at the time of application.

Even if you contact us, we can not respond.
Please read the following carefully.

Distribution of software

Applcation Software on your PC after 1 Aug

The cause of the inquiry is currently one of the following.
Both are caused by the user environment.
· Old license setting
· A network problem such as the laboratory and building where the client locates

In the meantime we will continue with the TSUBAME 2.5 distribution rule.
Distribution rules about COMSOL and schrodinger, which are newly introduced in TSUBAME 3, are to be developed.

Back to top
Pre/post processing of ISV applications
When using the ISV application on TSUBAME3, there are the following two cases.
- Perform all processing of pre / solver / post in TSUBAME3
- Perform pre / post processing on client and perform solver processing with TSUBAME3
1. When performing processing of pre / solver / post in TSUBAME3

In TSUBAME 3, basically all the functions of pre / solver / post are introduced, and in case of execution at interactive node, it is possible to perform all processing of "pre, solver, post".
How to run in interactive jobs and how to use each process depends on isv application. Please check the manual and user's guide of each application.

2. When performing pre / post processing on client and perform solver processing with TSUBAME3

Operation on TSUBAME may be unstable due to compatibility of X server. This problem can be avoided by performing pre-post processing on the client, so we distribute software.
Software is provided for improved convenience. Please note that distribution may be canceled depending on the situation.

The following procedure is necessary to perform pre / post processing on client and perform solver processing with TSUBAME3

Step 1: Apply for software usage and obtain it
Step 2: Install the software on the client
Step 3: Perform pre processing with software on the client
Step 4: Transfer the data created in Step 3 to TSUBAME
Step 5: Create a batch script for submitting the job scheduler
Step 6: Execute the qsub command in TSUBAME and execute the batch script created in Step 5
Step 7: Transfer the result data of Step 6 to the client
Step 8: Perform post processing with software installed on the client

Refer to Distribution of software. (As of November 15, 2017)
Applications newly introduced in TSUBAME3 are under preparation of distribution rules.
(Distributed application is out of support range)
Back to top

The range of support by T3 Helpdesk about the program error such as segmentation fault

General
Please check the following related FAQ first
About common errors in Linux
"Disk quta exceeded" error is output
Error handling for each ISV application

1. For ISV applications
Supported. Please inform the following information through inquiry.

■ Application name
　Eg) Abaqus/Explicit
■ Error message
　Eg) buffer overflow detected
■ JOB_ID
　Eg) 181938
■ Host name where the error occurred
　Eg) r6i7n5
■The situation in detail
　Eg) The error occured when logged in?to r6i7n5 interactively with qrsh and executed the following command. Details are as follows:

$ module load abaqus intel-mpi $ abq2017 interactive job=TEST input=Job1 cpus=6 scratch=$TMPDIR mp_mode=mpi
#Error#
　Run package *** buffer overflow detected ***: /pathto/package terminated ======= Backtrace: ========= /lib64/libc.so.6(+0x721af)[0x2aaab0c001af] …
(The rest is ommited)

ABAQUS is an academic license, so there is no technical support.
It is necessary to register on the SIMULIA documentation site and resolve it yourself.
For information on the documentation site, please contact us from "Contact Us".

2. For the application compiled yourself
Not supported. Please resolve it yourself.
See "I would like to use an application not provided by TSUBAME 3".

Error information is output when compiling with the traceback option.

Refer to the user guides if Intel or PGI used for compiling.

If you used Intel or PGI for compilation, please refer to the user guide.
Debugging by Allinea FORGE is also possible.

Error handling for each ISV application

General

-The following error occurs immediately after the program runs:

unable to connect to forwarded X server: Network error: Connection refused
Error: Can't open display: localhost:13.0.
Application name: Xt error: Can't open display:
Application name: DISPLAY is not set

The X server configration may be wrong. See the section 1 and 2 of the FAQ "X application doesn't work."
　

-The GUI program suddenly terminates
Please check the keep alive setting in the terminal you use. See the FAQ "Session suddenly disconnected while working on TSUBAME3."

-A job abruptly aborted　
Although various reasons can be considered, please check the gollowing.
-Check of batch error file (usually script_name.e.$JOBID )
-Check program-specific log file
-Check the free space of the directory

reference: FAQ
About common errors in Linux
"Disk quta exceeded" error is output
The range of support by T3 Helpdesk about the program error such as segmentation fault

Maple

-The following error occurs immediately after the program runs:

Exception in thread "Request id 1" java.lang.UnsupportedOperationException:PERPIXEL_TRANSLUCENT translucency is not supported

It is due to the compatibility between the application and the X server. See the section 3 of the FAQ "X application doesn't work."
It has been confirmed that the error occurs with Xming but not occurs with mobaXterm.

ANSYS

-The GUI program freezes

It is due to the compatibility between the application and the X server. See the section 3 of the FAQ "X application doesn't work."
The trouble occurs with the X server that does not support GL, and it does not occur with the other supporting GL such as ASTEC-X.
In Ansys R18.2, the operation of GL version Xming (Xming-mesa) has been confirmed.

Fluent

-The following error occurs immediately after the program runs:
Reduce the number of used nodes / processes to be less than the license limit.

In the example below, we are using an HPC license of 94 tokens(f_node=3)
In case of 2 f_nodes, error will not occur because it is within limit.

　Unable to spawn node: license not available.
　ANSYS LICENSE MANAGER ERROR:The request for 94 tasks of feature aa_r_hpc cannot be granted. Only 16 tasks are available.
　Request name aa_r_hpc does not exist in the licensing pool.
　Checkout request denied as it exceeds the MAX limit specified in the options file.
　Feature: aa_r_hpc
　License path: 27001@lice0:27001@remote:27001@t3ldap1:
　FlexNet Licensing error:-194,147
　For further information, refer to the FlexNet Licensing documentation,
　available at "www.flexerasoftware.com".

　Hit return to exit.
　The fluent process could not be started.

ex:Notification to apply the license restriction for ANSYS (Jan. 31)

ABAQUS

-The following error occurs immediately after the program runs:

Error in job ***: Error checking out Abaqus license.

It is the normal handling at the login node. The use of the solver license is prohibited at the login node.
ex:Notice of the restriction of using ABAQUS analysis on login node and temporary unavailable of it due to maintenance of the license server

Please run the job via the batch job scheduler.
See TSUBAME3.0 User's Guide for the batch job scheduling system.

-The following message is displayed while the program runs:

Analysis initiated from SIMULIA established products
Abaqus JOB intel_int
Abaqus 3DEXPERIENCE R2017x
Successfully checked out QEX/103 from DSLS server remote
Queued for QXT/103
"QXT" license request queued for the License Server on remote.
Total time in queue: 60 seconds.
Position in the queue: 1
Total time in queue: 30 seconds.
Position in the queue: 2
Total time in queue: 91 seconds.

It is in waiting because of the license shortage.
Calculation starts as soon as sufficient licenses are secured.
Note that The TSUBAME points are consumed even in waiting.

-Abaqus / Explicit generates the following error at parallel execution

Abaqus Error: Abaqus/Explicit Packager exited with an error - Please see the
status file for possible error messages if the file exists.
Begin MFS->SFS and SIM cleanup
Fri 17 Aug 2018 10:40:23 AM JST
Run SMASimUtility
Fri 17 Aug 2018 10:40:24 AM JST
End MFS->SFS and SIM cleanup
Abaqus/Analysis exited with errors

As a workaround, please read the countermeasure module as follows
abaqus/2017 Execute the module purge command before executing if it is reading.

$ module load abaqus/2017_explicit

COMSOL Multiphysics

-The following message is displayed while the program runs
It occurs with the X server that does not supoort GL. Please use the X server supporting GL or execute with software rendering mode.

$ comsol
function is no-op
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fff0f0252bf, pid=10485,
tid=0x00007fff0e7f9700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_112-b15)
(build1.8.0_112-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed
modelinux-amd64 compressed oops)
# Problematic frame:
# C [libcs3d_ogl.so+0x2a2bf]
#
# Failed to write core dump. Core dumps have been disabled. To enable
coredumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/XX/XXXXXX/hs_err_pidXXXXXX.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
/apps/t3/sles12sp2/isv/comsol/comsol53/multiphysics/bin/comsol: line
1615:10485 Aborted (core dumped)
${MPICMD}${FLROOT}/bin/${ARCH}/comsollauncher --launcher.ini
${LAUNCHERINIFILE}${LAUNCHERARGS} ${MPILAUNCHERARGS}

-The following message is displayed while the program runs

License error. Wait until the license is available and try again.
Please check the manual for checking the license.

　/******************/ 　/*****Error********/ 　/******************/ 　Could not obtain license for COMSOL Multiphysics. License error: -4. 　Licensed number of users already reached. 　Feature: COMSOL License path: 　/apps/t3/sles12sp2/isv/comsol/comsol53/multiphysics/license/license.dat: 　FlexNet Licensing error:-4,132 For further information, 　refer to the FlexNet Licensing documentation, 　available at "www.flexerasoftware.com". 　Total time: 4 s.

Materials Studio

-The following message is displayed while the program runs
Please be sure to check the output file as multiple causes can be considered only with the error below.

The job has failed.
Download any results generated so far?

(Results files will be permanently removed from Server)

ex: License error.
In the case of the example below it is a license error. Wait until the license is available and try again.

Laboratory etc. TSUBAME is executed by anything other than TSUBAME but it always occurs after the summer maintenance in 2018.
Reference:License restriction on ISV application usage

As a countermeasure, please use the batch job scheduler.
Please check the job scheduling system of TSUBAME3.0 User's Guide.

Output file (example of CASTEP)

 Job started on host GSIC
 at Thu Aug 16 13:20:06 2018

 +-------------------------------------------------+
 |                                                 |
 |      CCC   AA    SSS  TTTTT  EEEEE  PPPP        |
 |     C     A  A  S       T    E      P   P       |
 |     C     AAAA   SS     T    EEE    PPPP        |
 |     C     A  A     S    T    E      P           |
 |      CCC  A  A  SSS     T    EEEEE  P           |
 |                                                 |
 +-------------------------------------------------+

 This version was compiled for x86_64-windows-msvc2013 on Dec 07 2016
 Code version: 7217
 Intel(R) Math Kernel Library Version 11.3.1 
 Fundamental constants values: CODATA 2010
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Licensing Error !
Error: Manual heartbeat setup for MS_castep license failed
Failed to check out licenses
Trace stack not available

Output file (example of Dmol3)

     ===============================================================
     Materials Studio DMol^3 version 2017 R2         
     compiled on Dec  7 2016 22:56:21
     ===============================================================
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DATE:     Aug 16 13:20:06 2018
 
Job started on host GSIC

This run uses    1 processors
Licensing Error !
Error: Failed to checkout MS_dmol license
Message: DMol3 job failed
Error: DMol3 exiting
Message: License checkin of MS_dmol successful

ARM

ex: License error.

In the case of the example below it is a license error. Wait until the license is available and try again.

MAP: Your licence does not currently have enough processes available.
MAP: Requested Processes: 4
MAP: Available Processes: 3
Arm Forge 19.0.5 - Arm MAP

MAP: Your licence does not currently have enough processes available.
MAP: Requested Processes: 4
MAP: Available Processes: 2
MAP: Unable to obtain a valid licence.
Unable to obtain a valid licence.
Waiting for seat

Installing applications not provided by TSUBAME3
First of all, please refer to the following.

FAQ "I would like to use an application not provided by TSUBAME 3"

1.Installation directory
You can install in the following two places.
Please choose suited according to your operation.
If you need to share within the TSUBAME group such as members of the laboratory, please use the high speed storage area.
*Even if you change the permissions by chmod or some commands in the home directory, you can not share that.
```
 -Home directory. (/home/[0^9]/user_account/)
 -High speed storage area, as known as group disk. (/gs/hs[0-1]/TSUBAME_group/)
```
Reference: TSUBAME 3.0 User's Guide "3.Strage system"

2.Installation method

Please install application according to the manual or README or community forum of the application to be installed.
Depending on the application, it is necessary to compile the library or module or something from the source file by yourself.
The following is a typical installation example.
*Application management software such as zypper can not be used, you have to compile from the source file basically.
```
example 1) executing configure script, generating makefile, then make, make test and make install:
```
$ ./configure --prefix=$HOME/install
$ make && make test
$ make install
```
example 2)creating a directory for buid, cmake and make install:
```
$ mkdir build && cd build
$ cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/install
$ make install
```
example 3)installin with install script:
```
$ ./install.sh

3.How to install python module
For the installation of the python module, please check the related URL below.

related URL
How to install numpy, mpi4py, chainer etc. using python/3.6.5
Back to top

How to install numpy, mpi4py, chainer, tensorflow, cupy etc. using python/3.6.5

If you want to install numpy, mpi4py, chainer etc. using python/3.6.5, do as follows.

$ module purge
$ module load python/3.6.5
$ module load intel cuda openmpi
$ python3 -m pip install --user python_modules

If you want to specify the version, do:

$ python3 -m pip install --user python_modules==version

※When installing a module that uses GPU like cupy etc., please keep computing nodes with qrsh command. before you install it.

※For CuPy, it is possible to install faster by specifying corresponding cuda version such as cupy-cuda102 when invoking pip install.

How to install numpy linking intel MKL

Copy https://github.com/numpy/numpy/blob/master/site.cfg.example to ~/.numpy-site.cfg and edit the item of [mkl] as follows.

[mkl]
library_dirs = /apps/t3/sles12sp2/isv/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64
include_dirs = /apps/t3/sles12sp2/isv/intel/compilers_and_libraries_2018.1.163/linux/mkl/include
mkl_libs = mkl_rt

Then do the following

$ module load intel python/3.6.5
$ python3 -m pip install --no-binary :all: --user numpy

License restriction on ISV application usage

The following usage restrictions apply when using the application on campus.
(Excluding Gaussian, AMBER of unrestricted application)

Please do not occupy applications that have a small number of licenses.
Be sure to stop the application after the end of application use.

If a license is found to be occupied for a long period of time, the license may be forcibly recalled without warning. This will result in unstable operation of the application. In some cases, the connection may be blocked. (Only from Lab PCs)

General

　Longest continuous use period：1 week
　Restriction is per capita usage (excluding Materials Studio)

ABAQUS

Execution location	limitation
TSUBAME compute node	Solver limits 140 Token
TSUBAME login node	There are restrictions on execution other than CAE
Laboratory terminal	Like login node

ANSYS

Execution location	limitation
TSUBAME compute node	Base license is limited to 4 Token HPC license is limited to 64 Token
TSUBAME login node	Base license is limited to 4 Token HPC license can not be executed
Laboratory terminal	Like login node

Materials Studio

Execution location	limitation
TSUBAME compute node	CASTEP and DMol3 are Number of simultaneous usage 20 Token limit (Total number of all users)
TSUBAME login node	Unavailable
Laboratory terminal	CASTEP and DMol3 are Unavailable COMPASS is Number of simultaneous usage 4 Token limit Multiple start of visualizer is prohibited

Discovery Studio

Execution location

limitation

TSUBAME compute node

Unavailable

TSUBAME login node

Unavailable

Laboratory terminal

CHARMM and CHARMM Lite are Number of simultaneous usage 20 Token limit
Multiple start of visualizer is prohibited

Limit of 6 tokens per user (total with MaterialsStudio)

Notification to apply the license restriction for ANSYS (Jan. 31)

Software distribution service

Application software

data corruption by collective communications of GPU buffers on OpenMPI
There is a known problem of data corruption by collective communications of GPU buffers on OpenMPI.
(2019.04) This issue is resolved with the maintenance at the end of FY2018.

As a workaround, please give the following a try.
- MPI_Allgather()
mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_allgather_algorithm 2
- MPI_Alltoall()
mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_alltoall_algorithm 3

if the above does not solve the issue, please try the following.

mpirun -mca pml ob1

This issue will be fixed by OPA 10.8 (in the end of the fiscal 2018)
Back to top
I want to install and use my library in R
In TSUBAME3, R - 3.4.1 is available.
In addition to the basic package, the libraries available as default are as follows.
Rmpi, rpud, rpudplus

Please use library() command to check other available libraries.

If you wish to use a library other than the above, you will need your own installation operation.
Since the installation directory of R is impossible due to the permission relationship, you can install / manage your own library after specifying the library path. The procedure is as follows.

Assuming that the library path is $HOME/Rlib, the library name is testlib, and the testlib.tar.gz is the source package, and operate as follows.

Load modules：
＞module load cuda openmpi r

Create library installation directory (if nothing):
＞mkdir ~/Rlib

Downlod package：
＞cd ~/Rlib
＞wget https://cran.r-project.org/src/contrib/testlib.tar.gz

Install library：
> R CMD INSTALL -l $HOME/Rlib testlib.tar.gz

Your own installation library settings：
> export R_LIBS_USER=$HOME/Rlib

Use your library：
> R
library(testlib)

Back to top
An error occurs when mpi4py.futures.MPIPoolExecutor with openmpi is called
Sometimes like the following error occurs when mpi4py.futures.MPIPoolExecutor with openmpi is called.

[r5i7n2:26205] [[60041,0],0] ORTE_ERROR_LOG: Not found in file orted/pmix/pmix_server_dyn.c at line 87

If you faced the error, please try ether the one of the following:

1. mpirun -np <NP> python3 -m mpi4py.futures ./test.py

2. use mpi4py with intel MPI

Back to top
Port forward configuration for each terminal software
How to configure port forwarding for each terminal software as follows.

Please try the followings with allocating a compute node by qrsh/qsub.

As an example, suppose a compute node r7i7n7 is allocated, and connect local PC port 5901 to r7i7n7 port 5901.

1. MobaXterm

Tunneling -> New SSH Tunnel -> My computer with MobaXterm, input 5901 into "Forwarded port", in SSH server, input login.t3.gsic.titech.ac.jp into "SSH server", input username into "defaultuser", input 22 into "SSH port", in Remote server, input r7i7n7 into "Remote server", input 5901 into "Remote port" and save、choose key icon under Settings tab, and start the configured tunnel

2. OpenSSH/WSL

$ ssh -L 5901:r7i7n7:5901 -i <private key> -f -N <uesrname>@login.t3.gsic.titech.ac.jp

3. PuTTY

PuTTY Configuration -> Connection -> SSH -> Tunnels, input 5901 into "Source Port", r7i7n7:5901 into "Destination" and click "Add" and Open

4. teraterm

Setup->SSH forwarding->Add->input 5901 into "Forward local port", input r7i7n7 into "to remote machine", and input 5901 into "port" then click "OK"

Back to top
file output stops by doing mpirun ... >& log.txt & with intel MPI
With intel MPI, output might stop by doing background execution like the following.

mpirun ... ./a.out >& log.txt &

In this case, it can be avoided by the following:

mpirun ... ./a.out < /dev/null >& log.txt &

Back to top
I want to link intel MKL ScaLAPACK, what are the link options?
If you want to link intel MKL ScaLAPCK、please fill the appropriate contents into https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html, and get the link opthion from "Use this link line”.

example：link with LP64 + dynamic linking + intel MPI + ScaLAPACK

-L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm -ldl

And if you want custom MPI with BLACS, do

$ module load intel

$ cp -pr $MKLROOT/interfaces/mklmpi .

$ cd mklmpi

$ make libintel64 INSTALL_DIR=.

then you can get the custom BLACS library, please replace -lmkl_blacs_intelmpi_lp64 with it.

Back to top
segmentation fault occurs in mpirun/mpiexec.hydra with intel-mpi/19.6.166 or intel-mpi/19.7.217
There is a known issue that sometimes segmentation fault occurs in mpirun/mpiexec.hdyra with intel-mpi/19.6.166 or intel-mpi/19.7.217 by using shared resource type such as h_node, q_node.

$ mpirun -np 1 hostname
/apps/t3/sles12sp2/isv/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpirun: line 103: 24862 Segmentation fault mpiexec.hydra -machinefile $machinefile "$@" 0<&0

To avoid this issue, please use f_node resource type or use intel-mpi/19.0.117.

Back to top
An error occurs by invoking singularity build with qrsh
Sometimes an error like the following occurs by invoking singularity build with qrsh.

INFO: Starting build...
FATAL: While performing build: conveyor failed to get: error getting username and password: error reading JSON file"/run/user/0/containers/auth.json": open /run/user/0/containers/auth.json:permission denied

As a workaround, do the following after qrsh:

unset XDG_RUNTIME_DIR

and then do singularity build

2020/12/03 update:

Unsetting XDG_RUNTIME_DIR is now done by the job scheduler, therefore the above is no longer required.

Back to top
How to use VNC from MobaXterm
If you have a GUI application on TSUBAME and X forwarding fails to draw or the performance is insufficient, TurboVNC may improve the situation.
Since MobaXterm has a built-in VNC client function, it is relatively easy to use.
Please refer to User's Guide for how to start a VNC server on a compute node and how to connect it from MobaXterm.

Back to top
I want to install V8, rstan R packages on TSUBAME
If you want to install V8, rstan R packages on TSUBAME, please try the following procedures.

※ confirmed with the version V8: 3.4.2 and rstan:2.19.3 .

　There are a lot of dependency packages for that, but they are easy to install it so please download and install dependencies by R CMD INSTALL before trying to install V8 and rstan.

* V8

$ module load gcc cuda openmpi r v8

$ R CMD INSTALL -l ~/Rlib /path/to/V8_3.4.2.tar.gz

* rstan

$ module load gcc cuda openmpi r

$ mkdir ~/.R/

$ vi ~/.R/Makevars <----edit as follows

CXX14FLAGS=-O3 -Wno-unused-variable -Wno-unused-function
CXX14 = g++ -std=c++1y -fPIC

$ R CMD INSTALL -l ~/Rlib rstan_2.19.3.tar.gz

Back to top

An error occurs when using C++17 parallel algorithms with nvhpc module and gcc module such as gcc/10.2.0

Newer gcc module such as gcc/10.2.0 is required to be able to use C++17 parallel algorithms.

The following error occurs during the compilation like nvc++ -autopar=gpu ... when using C++ parallel algorithms with nvhpc module and newer gcc module such as gcc/10.2.0.

"/apps/t3/sles12sp2/isv/nvidia/hpc_sdk/Linux_x86_64/21.7/compilers/include-stdpar/thrust/mr/new.h", line 44: error: namespace "std" has no member "align_val_t"
return ::operator new(bytes, std::align_val_t(alignment));
^

"/apps/t3/sles12sp2/isv/nvidia/hpc_sdk/Linux_x86_64/21.7/compilers/include-stdpar/thrust/mr/new.h", line 66: error: namespace "std" has no member "align_val_t"
::operator delete(p, bytes, std::align_val_t(alignment));

If you encounter this error, please try this:

$ makelocalrc -x -d . -gcc `which gcc` -gpp `which g++`

$ export NVLOCALRC=$PWD/localrc

I can not enter anything on a terminal in jupyter lab with macOS + Safari
This seems to be a known bug with the combination of macOS Safari and jupyter lab.

https://xchop.blogspot.com/2019/03/macos-jupyter-labterminal.html

https://qiita.com/qasa/items/b5a6dce179efbf8760dc

Please use another browser(chrome or firefox, etc).

Back to top
An error such that "X fatal error. ***ABAQUS/ABQcaeG rank 0 terminated by signal 6 " occurs at modeling
An error such that "X fatal error. ***ABAQUS/ABQcaeG rank 0 terminated by signal 6 " occurs at modeling.

The error seems to occur when ABAQUS CAE is started, and modeling is performed in X transfer with MobaXterm, etc.

You can avoid this error by using VNC+VirtualGL, so please use it.

Please refer here for more information on how to use VNC from MobaXterm.

Please refer here for more information on how to use VNC via noVNC.

Please refer here for more information on how to use VirtualGL from VNC.

Back to top

What to do if ANSYS does not start

If the following error:

Script Error: Microsoft JScript runtime error; Index out of bounds;

occurs, please try below.

１. Quit all of Ansys related programs.
２. Log in to the relevant Linux machine as the relevant general user and execute the following command.

$ mv ~/.ansys ~/.ansys.bak
$ mv ~/.config/Ansys ~/.config/Ansys.bak
$ mv ~/.mw ~/.mw.bak
$ mv ~/.mw.old ~/.mw.old.bak　

　For Fluent
$ mv ~/.fluentconf ~/.fluentconf.bak

　For CFX
$ mv ~/.cfx ~/.cfx.bak

３. Restart Ansys (After rebooting, a new ".ansys" or "Ansys" folder will be created for each folder in step 2. and your personal settings will be reset).

Tips

Some useful configurations for OpenMPI

Acceleration of large scale collective communications for openmpi

There are four OPA units per node on TSUBAME3、only two is used by default.

Explicitly using four of them, as shown below, may speed up collective communication with large communication volume, such as MPI_Alltoall() when the number of nodes is large.

This is especially effective for GPU communication.

- wrap.sh

#!/bin/sh

export NUM_HFIS_PER_NODE=4

if [ $((OMPI_COMM_WORLD_LOCAL_RANK%NUM_HFIS_PER_NODE)) == 0 ];then
export HFI_UNIT=0
elif [ $((OMPI_COMM_WORLD_LOCAL_RANK%NUM_HFIS_PER_NODE)) == 1 ];then
export HFI_UNIT=1
elif [ $((OMPI_COMM_WORLD_LOCAL_RANK%NUM_HFIS_PER_NODE)) == 2 ];then
export HFI_UNIT=2
elif [ $((OMPI_COMM_WORLD_LOCAL_RANK%NUM_HFIS_PER_NODE)) == 3 ];then
export HFI_UNIT=3
fi

- an example job script(f_node=8、GPUDirect on)

#!/bin/sh
#$ -cwd
#$ -V
#$ -l h_rt=01:00:00
#$ -l f_node=8

. /etc/profile.d/modules.sh
module purge
module load cuda openmpi/3.1.4-opa10.10-t3

mpirun -x PATH -x LD_LIBRARY_PATH -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -npernode 4 -np $((4*8)) ./wrap.sh ./a.out

Acceleration of pt2 communications for openmpi

Multirailing (bundling) multiple hfi(opa) may speed up p2p communication.
This does not seem to be very effective for GPU communication, but seems to be effective for CPU communication.
To enable multirail, do the following (In the case of openmpi)

mpirun ... -x PSM2_MULTIRAIL=2 ... ./a.out

Also, the performance of p2p communication when using multirail seems to depend on the value of PSM2_MQ_RNDV_HFI_WINDOW (default value: 131072, maximum value: 4MB).

For more details of each parameters, please refer to here.

About Intra node GPU communication

As shown in the Hardware Archtecture, GPU0<->GPU2 and GPU1<->GPU3 are connected by two NVLink lines, which doubles the bandwidth.
If possible, it may speed up your program to have these GPUs communicate with each other.
※This can only be achieved with f_node, since h_node allocates only GPU0,1 or GPU2,3.

Some tips of openmpi

* Add rank number into the output

mpirun -tag-output ...

* Explicitly specify the algorithm of collective communication(By default, MPI dynamically selects an algorithm from communicator size and message size.)
ex: MPI_Allreduce()

mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_allreduce_algorithm <algo #> ...

Algorithm numbers of Allreduce are: 1 basic linear, 2 nonoverlapping (tuned reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
If you want to see the detail, please invoke the following:

ompi_info --param coll tuned --level 9

or:

ompi_info -all

And, if you want to disable tuned collective module, please do:

mpirun -mca coll ^tuned

This disables tuned module and switches it with basic module.

This may cause the performance degradation, but it may useful when tuned module is something buggy.

* Use ssh instead of qrsh -inherit ... which is launched by mpirun(f_node only)

mpirun -mca plm_rsh_disable_qrsh true -mca plm_rsh_agent ssh ...

By default, the processes are launched via qrsh -inherit, but if you are having problems with that, try this.

* Show current MCA parameter configurations

mpirun -mca mpi_show_mca_params 1 ...

Also, MCA parameters can be passed via environment variable as follows.

export OMPI_MCA_param_name=value

where "name" is the variable name, "value" is its value.

* Obtaining core file with openmpi job when segmentation fault occurs

When you want to get the core file by segmentation fault etc. with openmpi, it seems that the core file is not obtained even if you use ulimit -c unlimited in the job script.
You can get the core file by wrapping it as follows.

- ulimit.sh

#!/bin/sh

ulimit -c unlimited
$*

mpirun ... ./ulimit.sh ./a.out

* CPU bindings

CPU binding options in openmpi are the following.

mpirun -bind-to <core, socket, numa, board, etc> ...
mpirun -map-by <foo> ...

For more detail, please refer to man mpirun.

If you want to confirm actual binding, try the following:

mpirun -report-bindings ...

Some other tips

* Threshold of GPUDirect of sender side

Threshold of GPUDirect of sender side is 30000bytes by default.
Reciever side is UINT_MAX(2^32-1), so if you want GPUDirect to always be ON even when the buffer size is large, you can do the following on the send side.

mpirun -x PSM2_GPUDIRECT_SEND_THRESH=$((2**32-1)) ...

For more detail, please refer to here.

TSUBAME3 some workarounds of known issues
- Job hangs or error occurs when NCCL is used
Some problem that job hangs or error occurs when NCCL is used are reported.

In some cases, kernel panic also happens.

If you suspect that you have hit this problem, try the following

export NCCL_IB_DISABLE=1

or

export NCCL_BUFFSIZE=1048576

NCCL_IB_DISABLE=1 may decrease the performance, in that case please use NCCL_BUFFSIZE=1048576.
- segmentation fault occurs when MPI+OpenACC is executed
A problem that segmentation fault occurs when openmpi+OpenACC is used.

As a workaround, please give it a try:

export PSM2_MEMORY=large

or

export OMPI_MCA_pml=ob1
- Error occurs when GPUDirect is used.
(2021/08/19 updated)

This issue has been fixed by the last maintenance.

Some cases that error occurs when GPUDirect is used and when the program is exited normally or abnormally are reported.

This error happens rarely.

This also sometimes triggers kernel panic.

If you suspect this, please turn off GPUDirect as follows.

mpirun ... -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=0
- Job hangs with large scale job
Some problems that mpirun hangs with large scale job are reported before.

This seems to caused by qrsh -inherit that is fork()'ed by mpirun.

If you suspect this, please try the following.

* For openmpi

mpirun -mca plm_rsh_disable_qrsh true -mca plm_rsh_agent ssh ...

* For intel MPI

export I_MPI_HYDRA_BOOTSTRAP=ssh
unset I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS

※Please note that they are effective only for f_node.
Back to top

Migration from TSUBAME2.5

Does the application used in TSUBAME 2.5 work as it is in TSUBAME 3.0?
Because the compiler, MPI, and various libraries are different In TSUBAME 3.0 and TSUBAME 2.5, it can not be executed as it is. It is necessary to recompile the program on TSUBAME 3.0.

Back to top
Can TSUBAME 2.5 data be used directly in TSUBAME 3.0?
Because the storage connected to TSUBAME 3.0 is different from TSUBAME 2.5, the data stored in / home, / work0, / work1, / data0 of TSUBAME 2.5 can not be accessed directly.

The data migration procedure from TSUBAME2.0 to TSUBAME 3.0, will be published as soon as prepared

Back to top
Main differences between TSUBAME 2.5 and TSUBAME 3.0 (Portal / Login / Payment)
This article describes the main difference between the TSUBAME2.5 and TSUBAME3.0.
Please refer to "TSUBAME 3.0 User's Guide" and "TSUBAME portal User's Guide" from here for details.

The flat-rate option has been abolished

Two options were selectable for TSUBAME 2.5: "flat-rate usage (campus only)" and "measured-rate usage (purchase of TSUBAME points)".

For TSUBAME 3.0, flat rate was abolished. No matter wheather in-campus or not, the option has been unitized into measured rate.

Payment method has been unified into prepayment

There were two payment methods in the measured-rate option for TSUBAME 2.5: "prepayment" and "automatic charging (campus only)" .

For TSUBAME 3.0, automatic charging has been abolished. No matter wheather in-campus or not, payment method has been unitized into prepayment. Points purchased once by prepayment isn't refunded in principle. When purchasing points, please check the group, payment code, number of purchase items (price) in order not to make a mistake.

Password login to both TSUBAME3 and portal from inside campus is no longer in use

Until TSUBAME 2.5, it was possible to use the password authentication when logging in from the campus.

In TSUBAME 3.0, to improve security, you can't log in from the campus, except for some terminals, to perform password authentication. SSH public key registration is required. In addition, when logging in to the TSUBAME portal, there is a method using single sign-on from Tokyo Tech Portal or a temporary URL for login sent by e-mail. password isn't used in both cases.

Operations requiring passwords with TSUBAME 3.0 are as follows. If you do not use the following functions, you do not need to set a password.
- Connection to TSUBAME 3 high-speed storage by CIFS
- Changing login shell with chsh command
- Login to educational computer system
- Terminals in some training rooms / Login without using SSH key authentication from TSUBAME 2 (for data migration)
- Use of some ISV application licenses
TSUBAME 2.5 has a password expiration period of about half a year, TSUBAME 3.0 does not have a password expiration date. Please pay attention to password management and change as necessary.

The price and value of TSUBAME points has changed

Since TSUBAME 2.5 and TSUBAME 3.0 have different performance per node, usage fee per calculation time will differ.
In TSUBAME 2.5, about 1 node per hour is achieved with 1 point (e.g. calculation of 2 nodes 30 minutes at 1 point), TSUBAME 3.0 calculate about 1 node per secound with 1 point (e.g. calculation of 2 nodes 30 minutes at 3600 points).

For more information, please refer to rules / usage details.

ID verification is required when adding members to the group and budget usage of other faculty members

In TSUBAME 3.0, approval of the relevant user was required when adding members to a group or creating a payment code for the budget for which other faculty staff is responsible for budget. When an authorized user performs the above-mentioned operation, an e-mail for approval request will be sent to the user (in some cases it will be sent after confirmation by GSIC). Please login to TSUBAME portal and access the URL written in the e-mail and approve according to the displayed instructions.
Back to top
Main differences between TSUBAME 2.5 and TSUBAME 3.0 (login node / compute node)
This article describes the main difference between the TSUBAME2.5 and TSUBAME3.0. Please refer to "TSUBAME 3.0 User's Guide" and "TSUBAME portal User's Guide" for details.

What you can do with the login node

When you logged in to TSUBAME 2.5, the terminal was connected to the interactive node, which is the same constituent node as the compute node, and it was able to compile and debug the application on it.

In TSUBAME 3.0, the login node is a different configuration from the compute node (e.g. GPU is not connected), so it is not assumed to execute the application including debugging.
Although it is not a problem to transfer and deploy files and compile small programs at the login node, avoid using loads on login nodes such as debugging and running large-scale programs. Please correspond by using computing nodes interactively by the method described in the next section.

How to use interactive use

In order to use interactive use connecting directly to a compute node and inputting commands, please use the following command:
$ qrsh –g [TSUBAME3 group] -l [resource type]=[number] -l h_rt=[elapsed time]
Once the compute node is secured, the shell in the state logged in to the compute node will be displayed. When you exit this shell, the interactive usage is terminated and the compute nodes are released.

If -g is omitted, the TSUBAME points are not consumed, but the restriction such as execution time (within 10 minutes) as "trial execution" is applied.

When using the GUI application, it can not be executed from the shell executed by qrsh. Therefore, after starting interactive use of the resource type f_node in the above procedure, you can use it by connecting ssh -Y from the login node to the compute node at another terminal which is different from the one where qrsh is running. Please be aware that you need to specify the -Y option both of to the login node from the terminal, and ssh from the login node to the compute node.

How to run applications (setting of PATH etc.)

In TSUBAME 2.5, most applications were executable at the time of logging in. executing specified environment setting scripts when switching some applications, versions, or MPI environment respectively I switched the environment. When changing version, changing MPI environment, in some applications, the environment was switched by executing the specified environment setting script.

when logged in to TSUBAME 3.0, environment variables for most applications are not set. It can be executed by explicitly loading the module file corresponding to the application to be used.
For example, when using the Intel compiler, CUDA, OpenMPI, it must be set before compiling the application and executing the job as described below:
$ module load intel cuda openmpi

Job restrictions
Please check "Various limit value list" about the current limit.

In TSUBAME 2.5, instead of paying more TSUBAME points, it was permitted to run jobs longer than 24 hours as a premier option. The option to extend execution time has been abolished in TSUBAME 3.0. The maximum execution time for all jobs is 24 hours.

It was able to perform large reservation execution of 16 nodes or more by using H queue on a daily basis.

In In TSUBAME 3.0, it is currently in preraration so that reservation execution of 1 node or more, 1 hour unit can be performed.

Back to top
Main differences between TSUBAME 2.5 and TSUBAME 3.0 ( node reservation )
This article describes the main difference between the TSUBAME2.5 and TSUBAME3.0.
Please refer to "TSUBAME portal User's Guide" for node reservation method and "TSUBAME 3.0 User's Guide" for job submission method to reserved node.

In addition, some setting and limit values may be updated, taking into consideration the system usage situation.We will announce you when changing the setting, so please periodically check the latest notice.

Small reservation are easier take as better than before.

In the H queue of TSUBAME 2.5, reservations could only be made with 16 node or more, 1 day unit and large scale execution, but in TSUBAME 3.0, it is possible to reserve one node or more per hour In addition to large-scale execution, it can be used for long-term execution, etc.

Reservation relationship limit value list of TSUBAME 3.0
Please check "Various limit value list" about the current limit.

Maximum number of reserved nodes: 135nodes (October-March), 270nodes (April-September)
Reservation time length: 1hour~96hours(4 days)(October-March), 1hour~168hours(7 days)(April-September)
Total number of reservation slots that one group can at once: 6480nodes-hours(October-March), 12960nodes-hours(April-September)

About the time that a job can actually be executed

In TSUBAME 2.5 we were able to occupy the node from 10 am on the reservation start date until 9 am on the reservation end date.
In TSUBAME 3.0, the node can be used from the reservation start time to 5 minutes before the reservation end time, and all jobs are stopped 5 minutes before the end time.

On submitting jobs to reserved nodes

By adding " -ar reservation number "to the arguments of qsub, qrsh etc., you can submit the job to the reserved node.(You can submit a job before the start time of reservation slots)
Please note that if you do not specify " -ar reservation number ", you will consume points and execute the job outside the reservation slots."
Even if you are using a resource type other than f_node, please be aware that you can not submit jobs with more than parallel number of reserved nodes.For example, h_node 40 parallelism can not be executed when 20 nodes are reserved. You can run two parallel jobs at the same time.

About SSH / direct login to reserved node

In TSUBAME 2.5, members of the TSUBAME group who made the reservation were able to execute SSH in order to calculate with the reserved node, and execute directly without going through the scheduler.
With TSUBAME 3.0, only users submitting jobs can perform SSH only when submitting f_node jobs.To execute the program directly, create a job with the required number of nodes with f_node, or log in with qrsh.

Attention on appointment just before the start time

For TSUBAME 2.5, the point consumption of the reservation starting within one week was constant.
In TSUBAME 3.0, Reservations to start within 24 hours is four times higher than Reservation for more than 24 hours (within 2 weeks).
This is to avoid affecting jobs other than the reservation that has already been submitted.
In addition, since the node used for reservation and the node used for jobs other than reservation are shared, there is a high possibility that reservations within 24 hours can not be secured depending on job execution status.
For large-scale execution, prepare in advance and recommend early reservation.

Note If you decide to cancel the reservation

In TSUBAME 2.5, TSUBAME points consumed when deletion of reservation was fully points returned, but TSUBAME 3.0 only returns TSUBAME points up to half price except for the following reasons when dealing reservation.
- Case :Cancellation within 5 minutes after making reservation
- Canceled without user's responsibility, such as system maintenance
Reserving a node makes it harder for other jobs to be executed in order to reserve a compute node at the reserved time.Please confirm the reservation contents carefully when reserving the node and reserve only the necessary amount.
Back to top

⚠️⚠️⚠️運用終了したTSUBAME3のページです⚠️⚠️⚠️

TSUBAME4.0のWebサイトはこちら

General

Account

TSUBAME point and TSUBAME portal

Job Execution (Scheduler)

Application Usage

Tips

Migration from TSUBAME2.5

General

To install a file transfer application

If you are using Linux/Mac/Cygwin (Windows) etc. (rsync, scp, sftp commands)

rsync:

scp:

sftp:

To use CIFS access

Related FAQ

About group disk

What is the group disk grace period ?

Related FAQ

Pack the files to appropriate size

Change transfer protocols

Remove the bottleneck on the network route

(Tokyo Tech Users Only) Use the iMac terminal of Education Computer Systems

1.No such file or directory

2.command not found

3.Permission denied

4.Disk quota exceeded

5. Out Of Memory

OpenSSH client (Windows 10 functionality)

Window Subsystem for Linux

Cygwin

/gs/hsX/tgX-XXXXXX/

/gs/hsX/tgX-XXXXXX/

/gs/hsX/tgX-XXXXXX/

Synchronize TSUBAME data to the terminal on your local PC.

Account

TSUBAME point and TSUBAME portal

1. The existence of the corresponding budget could not be confirmed (該当する予算の存在を確認できませんでした)

2. I was able to confirm that I had the budget code, but XXX was different. (予算コードがあることまでは確認できましたが、XXXが異なっていました)

3. I was able to confirm that there was a budget code, but the name of the person responsible for the budget was different. (予算コードがあることまでは確認できましたが、予算責任者氏名が異なっていました)

4. Rejected because the expense category is not "Other" (費目がその他ではないため、却下します) (Grant-in-Aid for Scientific Research)

5. If you are not a faculty or staff member, you must be responsible for your own budget. (教職員以外の方は、自身が予算責任者となっている申請に限らせていただいております)

6. The period in which new claims can be generated has already passed. (既に新規請求事項発生可能期間を過ぎています)

7. The system administrator has already approved the application for the same fiscal year and the same budget code. (既に同じ利用年度かつ同じ予算コードの申請をシステム管理者が承認済みです)

Job Execution (Scheduler)

qsub: Unknown option

Job is rejected, h_rt can not be longer than 10 mins with this group

Unable to run job: Job is rejected. h_rt must be specified.

Unable to run job: the job duration is longer than duration of the advance resavation id AR-ID.

error: commlib error: can't set CA chain

1.Example when the memory resource is exceeded

2.Example when the reserved time is exceeded

Resource Type

Job submission method

About job limit

About reservation

1.Forgot to add ARID

2.h_rt longer than reserved time

check the log and investigate what is happening

Initialize the environment

Check and resolve module dependencies, avoiding initialization of the environment

Application Usage

General

ABAQUS

ANSYS

Materials Studio

Discovery Studio