General


  • Warning: If your SSH private key is leaked, your account will be misused by the third-party. Please secure your private key with setting the passphrase.

    This article describes how to create SSH keypair to be used TSUBAME3 using PuTTYgen, which will be installed with PuTTY.
    MobaKeyGen from MobaXterm has the same functionarity and UI.

    You will get a dialog similar to this by executing PuTTYgen:
    PuTTYgen screen

    1. Press "Generate" to create SSH key-pair
      You can adjust key-pair configurations using "Parameters" box, but you don't have to do so in most cases.
    2. Press "Save private key" to save a private key file of generated key-pair for future login.
      After registering public key to TSUBAME3, anyone who can read this private-key file can log in to TSUBAME3 with your account. Please keep this file safe, DO NOT carry with USB stick, or send via e-mail etc.
      You can force to enter passphrase to use this key file, by inputting them in "Key passphrase" and "Confirm passphrase" boxes before saving.
    3. Copy all texts in "Public key for pasting..." box, paste it into "Enter SSH public key code" in "Register SSH public key" menu in TSUBAME portal, and then submit this.

    When you want to login into TSUBAME, open the file saved in step 2. with Pageant (.ppk files are associated with Pageant if you install PuTTY using installer with default options) in advance, or specify the file in "Private key file for authentication" box under "Connection"-"SSH"-"Auth" menu in PuTTY connection setting dialog.




  • TSUBAME 3.0 is a supercomputer operated and managed by Global Scientific Information and Computing Center (GSIC) of the Tokyo Institute of Technology. TSUBAME 3.0 has a theoretical calculation performance of 47.2 PFlops (half precision) and is expected to be the largest supercomputer in Japan handling a wide range of workloads including big data and AI in addition to conventional High Performance Computing. In addition, pursuing high density and power saving, realizes the theoretical PUE value of 1.033.




  • GNU, Intel Compiler and OpenMPI can be used in combination.

    It is available by loading the compiler module to be used before loading the OpenMPI module.

    As GNU is provided by OS, it is 4.8.5.
    Please check the available version with the following command.

    $ module av


    Below is the usage method.

    1. Intel OpenMPI

    $ module load intel 
    $ module load cuda 
    $ module load openmpi/2.1.1 
    $ mpicc -V 
    Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.4.196 Build 20170411 Copyright (C) 1985-2017 Intel Corporation. All rights reserved.

     

    2. OpenMPI for GNU

    $ module purge /*purge of already loaded module*/
    $ module load cuda $ module load openmpi/2.1.1
    $ mpicc -v
    Using built-in specs. COLLECT_GCC=/usr/bin/gcc COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/4.8/lto-wrapper Target: x86_64-suse-linux Configured with: ../configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,java,ada --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.8 --enable-ssp --disable-libssp --disable-plugin --




  • It is caused by the flow control by specific input characters, enabled on the default terminal setting.

    Flow control is a function to temporarily hold a data transfer in order to prevent overflow of the receiving side, for example, when the transmission speed exceeds the reception packet speed in data transfer. In general, Ctrl+S is used for pending transfers, and Ctrl+Q is used for restart as control characters.

    When editing with Emacs interactively and overwrite saving, you have to enter Ctrl+S, but it is also a flow control character, so the packet will not be transferred and it will be as if it were frozen. To fix, please enter Ctrl+Q.

    To disable flow control, you need to execute the following command before running interactive job.

    stty -ixon

    If you want to always disable flow control, add the above command to .bashrc in your home directory.




  • Use of TSUBAME is limited to education, research, clerical work and social contribution purpose only. It can not be used for applications that directly lead to private financial interests. For example, mining virtual currency using block chain technology.




  • the basic configuration of Module file is described below.


    - Module file listed as [Application name]/[Version].


    - If you do not specify a version, it loads the preset default version

    - If more than one version exists and the default is set, "(default)" is displayed after the version.

    Example:

    $ module load intel
    $ module list

       Currently Loaded Modulefiles:

        1) intel/17.0.4.196


    - For applications with dependencies such as MPI, they can be used by loading them in advance.

    Example: namd

    If there are insufficient modules, error will be displayed

    $ module load namd

       namd/2.12(3):ERROR:151: Module 'namd/2.12' depends on one of the module(s) 'intel/17.0.4.196 intel/16.0.4.258'
       namd/2.12(3):ERROR:102: Tcl command execution failed: prereq intel

    After loading Intel Compiler and cuda, load namd.

    $ module load intel
    $ module load cuda
    $ module load namd
    $ module list

      Currently Loaded Modulefiles:

      1) intel/17.0.4.196   2) cuda/8.0.44        3) namd/2.12




  • UGE assigns virtual CPUID / GPUID according to the specified number of resources except f_node.

    • In case of CPU

    As an example of s_core of a resource type that reserves only one CPU and resource type q_core that reserves 4 CPUs,
    When s_core=7 is specified, seven nodes are allocated and 1 core of each node is allocated.
    When q_core=7 is specified, seven nodes are allocated and 4 cores of each node are allocated.

    • In case of GPU

    In the case of resource type s_gpu that reserves only one GPU,
    When s_gpu=4 is specified, 4 nodes are reserved and the GPU of each node is virtually assigned as GPU 0.
    Just because you secured 4, it does not mean GPU 0, 1, 2, 3.

    In h_node which is a resource type that reserves 2 GPUs, 2 GPUs are allocated within the node, in this case GPU 0 and 1 are allocated.




  • The difference between login node and compute node is as follows.

    1. Hardware

        Login node Compute node
      # of nodes 2 540
      CPU

    Intel Xeon E5-2637 v4 3.50GHz 16core x 2 

    Intel Xeon E5-2680 v4 2.40GHz 14core x 2

      Memory 64GiB 256GiB
      GPU - NVIDIA Tesla P100  x4
      Interconnect Intel Omni-Path HFI 100Gbps x2 Intel Omni-Path HFI 100Gbps x4
      NVMe SSD - Intel SSD DC P3500 2TB

     

    2. Software

        Login node Compute node
      OS SUSE Linux Enterprise 12 SP2 SUSE Linux Enterprise 12 SP2
      Kernel

    4.4.74-92.29-default

    4.4.74-92.29-default

      GPU Driver - nvidia-375.66

    The login nodes are shared servers and are not assumed to be used for calculation purpose. Please avoid high-load processing such as program execution at the login nodes, execute it on compute nodes though job scheduler.
    Please terminate it immediately using kill command when you accidentally executed a high-load program.




  • Warning: If your SSH private key is leaked, your account will be misused by the third-party. Please secure your private key with setting the passphrase.

    SSH key pair creation method in Linux / Mac / Windows (Cygwin or OpenSSH) is as follows.
    Please check man ssh-keygen command for key type difference.
    There are correspondence / unsupported types depending on the version of openssh.

    ecdsa key type:

    $ ssh-keygen -t ecdsa

    RSA key type:

    $ ssh-keygen -t rsa

    ed25519 key type:

    $ ssh-keygen -t ed25519

    When you execute one of the above commands, you will be asked for the save location as follows.
    If there is special circumstance to avoid, such as the same filename is already used for other purpose, just press Enter key to use the default value.
    (If you are already using SSH key pair for other sites, you can reuse the same file for TSUBAME)

    Generating public/private keytype key pair.
    Enter file in which to save the key $HOME/.ssh/id_keytype: (No need to type filename)[Enter]

    Then you will be prompted for a passphrase, so enter it.

    Enter passphrase (empty for no passphrase): (Set passphrase; What you type will not appear in screen) [Enter]

    Re-enter your passphrase for confirmation.

    Enter same passphrase again: (Enter the same passphrase again for confirmation; What you type will not appear in screen) [Enter]

    A key pair is created and saved to two files. The upper line shows the location of private key, and the lower line shows that of public key. Register the public key via TSUBAME portal.

    our identification has been saved in $HOME/.ssh/id_keytype
    Your public key has been saved in $HOME/.ssh/id_keytype.pub.
    The key fingerprint is:
    SHA256:random number:username@hostname
    The key's randomart image is:
    (Some text specific to the generated key pair will be shown)

    Check the file with the following command.

    $ ls ~/.ssh/ -l
    drwx------  2 user group     512 Oct  6 10:50 .
    drwx------ 31 user group    4096 Oct 6 10:41 ..
    -rw-------  1 user group     411 Oct 6 10:50 private_key
    -rw-r--r--  1 user group      97 Oct 6 10:50 public_key

    パーミッションがあってない場合は以下のコマンドで修正します。

    $ chmod 700 ~/.ssh
    $ chmod 600 ~/.ssh/private_key



  • Please refer the following checklist before contacting us for help.

    1. Is your account name correct?
    Please confirm you are using TSUBAME3 account.
    The number of inquiries to fail using TSUBAME2 account is increasing.
    Please refer to the following link for how to get the TSUBAME3 account.
    http://www.t3.gsic.titech.ac.jp/en/getting-account 


    2. Did you registered a public key in correct format?
    Please confirm that you have registered a public key in OpenSSH format to TSUBAME portal.
    You can not login to TSUBAME3 if you registered a public key in PuTTY format.

    Please refer to the following links for how to create a key pair.
    http://www.t3.gsic.titech.ac.jp/en/node/37
    http://www.t3.gsic.titech.ac.jp/en/node/79

    Please refer to the following for how to register the public key.
    https://helpdesk.t3.gsic.titech.ac.jp/manuals/portal.ja/prepare/#ssh_key 


    3.  Is the command you entered correctly? ( for Linux / Mac / Windows(Cygwin) )
    Please make sure your login name and path to your private key file (*) are specified in the command line correctly.

    $ ssh <TSUBAME3 account>@login.t3.gsic.titech.ac.jp -i <private key>

    Example)
    When your login name is gsic_user and the location of a secret key is ~/.ssh/t3-key,

    $ ssh gsic_user@login.t3.gsic.titech.ac.jp -i ~/.ssh/t3-key

    *: If your private key file is stored in one of the following locations under your home directory (the case that you have not modified the location from the default value), you can omit "-i <private key>" part.

    • .ssh/id_rsa, .ssh/id_dsa, .ssh/id_ecdsa, .ssh/id_ed25519

    Please refer to the following man command for options of ssh command.

    $ man ssh 


    4. Does the symptom reproduce in another terminal environment?
    There are various types of terminal software for Windows.
    Please check whether it reproduces even with another terminal software.
    If not reproduce, it may be a software-specific problem.
    In that case we can not respond to your inquiries, so please understand.


    If you do not solve any of the above solutions, please contact us with the following information.


    ■Operation system (Windows10, Debian10, macOS, Sierra10.12.6 and so on)
    ■Terminal software and the version (Cygwin, PuTTY, Rlogin and so on)
      For details on how to check the version, please refer to the terminal software manual.
      For Linux / Mac OS,  please send SSH version. You can check with the following command.

    $ ssh -V

    ■The operation you tried. If you get an error, please send the details.
      For Linux / Mac OS, please send the output of ssh command with -v option (debug mode) including the command line itself.
      Example)
        When an account name is gsic_user and the location of a secret key is ~/.ssh/t3-key,

    $ ssh gsic_user@login.t3.gsic.titech.ac.jp -i ~/.ssh/t3-key -v




  • Terminate the program according to the following procedure when you accidentally executed a program on the login node where the program execution is prohibited.
     

    See "How to terminate the job submitted to the batch job scheduler" for the deletion of the jobs submitted to the batch job scheduler.

    1. Confirm the PID of the process

    Show information of the process you want to terminate with top and/or ps commands.

    Check the PID with top command.

    $ top

    Tasks: 1457 total,   1 running, 1441 sleeping,  11 stopped,   3 zombie
    %Cpu(s):  78.8 us,  1.3 sy,  0.0 ni, 96.8 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
    KiB Mem:  65598488 total, 18563160 used, 47035328 free,              8 buffers
    KiB Swap:   7812092 total,   7422860 used,     389232 free.  6553100 cached Mem
      PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
    20680 GSIC   20   0 1157756 5.056g  20628 R 1467.0 1.688   0:01.88 python 
            1 root      20   0  479464 294444   2940 S  0.000 0.449  76:02.24 systemd
            2 root      20   0            0          0          0 S  0.000 0.000  11:25.50 kthreadd   
            3 root      20   0            0          0          0 S  0.000 0.000   9:48.70 ksoftirqd/0
            9 root      20   0            0          0          0 S  0.000 0.000   0:00.00 rcu_bh 
          10 root      rt     0            0          0          0 S  0.000 0.000   0:45.45 migration/0

    2. Termination of the process

    Terminate the process with kill command.
    Specify the PID as the argument. (20680 in this example)

    $ kill 20680

    3. Conformation that the process is terminated
    Use top and ps commands to confirm the process terminated.
    if the process is not displayed, it has correctly terminated.
    Proceed to 4 if the process does not terminate.

    4. Force termination of the process
    Execute the command below if the process does not terminate.

    $ kill -9 20680

    After that, use top and ps commands to confirm the process terminated.




  • This part shows the flow until setup an environment for running program.

    There are 6 steps necessary to use TSUBAME3.

    When step 1 and 2 are done, login is enabled.
    To submit jobs, you need to complete additional steps 3 - 5.
    In addition to your home directory of 25GiB, if you need additional volumes, do step 6.

    1. Getting an account [each user]
    2. SSH key pair generation and the public key registration [each user]
    3. Creation of a group [group administrator only]
    4. Addition of users to the group [group administrator and its members]
    5. Point purchase [group administrator only]
    6. Setup of group disk [group administrator only]

    * [] is the persons who need to do the step.

    For details, refer to TSUBAME Portal User's Guide.

    For storages provided on TSUBAME3, refer to TSUBAME3.0 User's Guide "3. Storage system".

    If you can not login, please troubleshoot according to Cannot login to TSUBAME3.




  • File transfer by rsync, scp, and sftp is available on TSUBAME3.
    As well as login, you need to access with your SSH private key which is a pair of your SSH public key registered in TSUBAME3 portal.
    Also, please check the settings of the application you are using carefully, as some applications may time out.

    To install a file transfer application

    If you are using MobaXterm or RLogin, it is easier to use the built-in file transfer function of these software.

    If you are using other software such as PuTTY for connection, you need to install a file transfer application such as FileZilla or WinSCP that supports sftp and rsync protocols.
    In this case, as well as login, you need to access using SSH private key which is a pair of SSH public key registered in TSUBAME3 portal.
    For Filezilla and WinSCP, you can use the .ppk format key files that you usually use for PuTTY.
    For details on how to use each software, please refer to the manual of each software.

    If the option feature "OpenSSH Client" in Windows 10 is enabled, you can use scp and sftp command from command prompt or powershell.

    If you are using Linux/Mac/Cygwin (Windows) etc. (rsync, scp, sftp commands)

    In these environments, rsync, scp, and sftp commands are available.

    Describes three ways each, rsync, scp, sftp.

    rsync:

    To transfer from the local to the remote host, execute the following command.
    If you set the standard path/file name as the key pair location, the -i option is not required.

    $ rsync -av --progress -e "ssh -i Private_Key_File -l Login_Name" Local_Directory Remote_Host:Remote_Directory

    Local_Directory is the transfer source, Remote_Host:Remote_Directory is the transfer destination. For example, the command when the user with login name "GSCIUSER00" copies the current directory to /gs/hs0/GSIC of TSUBAME 3.0 using ~/.ssh/ecdsa of private key is as follows.

    $ rsync -av --progress -e "ssh -i ~/.ssh/ecdsa -l GSICUSER00" ./ login.t3.gsic.titech.ac.jp:/gs/hs0/GSIC

    For details such as how to specify the transfer source and tranfer destination, please execute the following command and confirm it.

    $ man rsync

    scp:

    To transfer from local to remote host, execute the following command.
    If you set the standard path/file name as the key pair location, the -i option is not required.

    $ scp -i Private_Key_File Login_Name@Remote_Host:Remote_Directory Local_directory

    Please enter the suitable phrase for your situation in the < >.For example, the command when the user with login name "GSCIUSER00" copies the current directory to /gs/hs0/GSIC of TSUBAME 3.0 using ~/.ssh/ecdsa of private key is as follows.

    $ scp -i ~/.ssh/ecdsa GSICUSER00@login.t3.gsic.titech.ac.jp:/gs/hs0/GSIC .

    For details such as how to specify the transfer source and tranfer destination, please execute the following command and confirm it.

    $ man scp

    sftp:

    To transfer interactively, execute the following command.
    If you set the standard path/file name as the key pair location, the -i option is not required.

    $ sftp -i Private_Key_File Login_Name@Remote_Host

    For example, the command when the user with login name "GSCIUSER00" copies the current directory to /gs/hs0/GSIC of TSUBAME 3.0 using ~/.ssh/ecdsa of private key is as follows.

    $ sftp -i ~/.ssh/ecdsa GSICUSER00@login.t3.gsic.titech.ac.jp

    For details such as how to specify the transfer source and tranfer destination, please execute the following command and confirm it.

    $ man sftp

     

    To use CIFS access

    In addition, only on-campus terminals can be accessed via CIFS.

    The CIFS address is \\gshs.t3.gsic.titech.ac.jp.
    Please note that even in the campus network, CIFS may be blocked by routers on the way, in which case it cannot be used.


    For storages provided on TSUBAME3, refer to 3.3. Storage service (CIFS)" in TSUBAME3.0 User's Guide.




  • The content depends on what you are a beginner for.

    1.Beginners of UNIX/LINUX
    Upon using TSUBAME 3, users are required to master UNIX/Linux proficiently levels. Handbooks are made on this assumption

    If you do not understand the content of the handbooks, please read the UNIX/Limux  beginner's book at the library, and understand how to use UNIX shells and commands.

    You should find this helpfull.

    There are various publications on the operation of the "terminal emulator" software. Please check this according to your application software too.

    First, please understand the operation of UNIX/Linux, then read our guidebooks, then check the section 3.


    2.Beginners of supercomputer
    If you have used UNIX/Linux but never used a job scheduler, please read the section No.5 "Job Scheduler" of TSUBAME 3.0 User's Guide.

    3.Beginners of TSUBAME 3
    Users who used TSUBAME 2.5 in the past.
    Technical speficications differ between TSUBAME 3.0 and TSUBAME 2.5. 

    TSUBAME 2.5 binaries do not work in TSUBAME 3.0, and also binary just recompiled the source code made on TSUBAME 2.5 with TSUBAME 3 may not work. 

    Please confirm the technical specification of TSUBAME 3.0 and recompile it with appropriate modification if do not work.

    TSUBAME 3.0 system configuration is described in Hardware, System Software, Application Software, and "Over view of TSUBAME3" of "Introduction to TSUBAME (Linux basics)".

    And please check the section "Migration from TSUBAME 2.5" of FAQ and the TSUBAME 3.0 User's Guide.

    4.Beginners of ISV application software.
    Please check the application software guide for each. In addition, TSUBAME 3.0 regularly conducts workshops, check the page of seminar.




  • For TSUBAME3, session timeout is set for security measures.
    Sessions that do not have any input at a certain time are disconnected.
    Even if the GUI application is started up and operated, if there is no input to the terminal, it will be disconnected.

    If you want to avoid this, please set keep alive on the terminal side.
    Please check the user guide of the terminal for keep alive setting.




  • When trying to access a group disk from Windows Explorer, it may be displayed as follows.
    The problem may occur for the account which created early (before September 2017).

    CIFS接続エラー

    This is due to the fact that the password of TSUBAME account has expired.

    Please update your password from the TSUBAME Portal Page.




  • This message indicates there is no space left in ether a home directory or a group disk.

    When you face it, you should delete unused files or purchase an additional group disk to keep enough free disk space.

     

    Please note that temporary files are generated at the home directory in some cases, and an application sometimes needs over 25 GB of a disk space for creating temporary files. (25 GB is the capacity of a home directory)

    To avoid running out of disk space, we recommend not to use home directory but to use local scratch area or shared scratch area for temporary file location.

     

    Related FAQ




  • About group disk

    The group disk is a shared storage that can use the capacity set by the TSUBAME portal for each TSUBAME group written in TSUBAME3.0 User's Guide "3.2 High-speed storage area" 

    • Usage period: From purchase date to the end of the fiscal year (end of March).

     For example, if you purchase 10TB on 1 May, 3,960,000 points are reuired (36,000 points x 10TB x 11months(until the end of March).
        Even if you buy 10TB on 31 May, which is the end of the month, it will be 3,960,000 points as you would purchase on 1st day.

    • Purchase unit: 1TB unit(2,000,000 inode)
      Up to 300TB per group.

     If unused, you can receive return points by reducing purchase capacity.
     For example, if you purchase 1TB in April, and delete all data in May to reduce capacity, 396,000 points (36,000 points x 1TB x 11months (until the end of March) will be returned. 

        Reference:TSUBAME Portal User's Guide "10. Management of Group Disk"


    What is the group disk grace period ?

    Group disks are reset once at the end of the fiscal year, and all group disks are in a grace state that can only be read/deleted.
    This period is called the grace period, and usually it will be maintained around the middle of April (17 Apr. on this year).

    Reference:TSUBAME Portal User's Guide "10.2. About the validity period of the group disk" 

    If the data of the previous year remains and you purchase it after the grace period, it becomes as follows.
    For example, if you purchased 50TB in the previous year and you used a capacity of 45TB.

    1) When 45TB is deleted during the grace period and the used capacity is 0.
         You can purchase from 1TB which is the minimum capacity.

    2) When 25TB is deleted during the grace period and the used capacity is 20TB.
        Available from over 20TB.

    2) If the used capacity is not deleted during the grace period (used capacity is 45TB)
        Available from over 45TB.

    *If you do not need the previous year's data, please delete it during the grace period.


    Related FAQ

    Checking the usage of group disks with command
    Can not establish CIFS connection to the group disk
    "Disk quota exceeded" error is output




  • The GPU clock can be changed only with f_node.

     

    • Display available clock frequencies

    nvidia-smi -q -d SUPPORTED_CLOCKS
    • Changing the clock

    nvidia-smi -ac specified clock

    ex.)
    nvidia-smi -ac 715,999

    • Resetting the clock

    nvidia-smi -rac

     

    Device is specified with -i.
    For details, see the command help.




  • The IP address range of the compute node gateway server is as follows.
    131.112.3.250-131.112.3.253

    When computing on TSUBAME by using a campus or university license server, please set so that communication within the above range is permitted.

    Please keep in mind that the above address may be changed without notice from the circumstances of operation.

    If your software requires communication with a license server outside of TSUBAME (e.g., in a laboratory), please confirm that you can communicate with the license server from a network outside of TSUBAME and outside of the license server before contacting us with the following information.

    • Global IP address of the license server
    • Port number of the license server (or all ports if there are more than one)
    • IP address of the host where the communication test was performed



  • Please consider the following topics to improve the performance of data transfer between TSUBAME and external computers.

    Pack the files to appropriate size

    Large amounts of small files reduce transfer speed. Pack such files using the tar command to archives of 1GB size each.

    Change transfer protocols

    If you do not get enough speed with scp / sftp, consider using rsync or CIFS(Tokyo Tech users only) protocols.

    For more details on the CIFS connection, please refer "Storage Service (CIFS)" section of the TSUBAME users guide.

    Remove the bottleneck on the network route

    • If you have old LAN cables (CAT-3 or CAT-5 (not CAT-5e)), switching hubs, or routers whose link speed is lower than 1000 Mbps, replace them with newer ones.

    • When using a router (WiFi router, NAT router, broadband router, etc.), connect your computer to the external network (in Tokyo Tech, IP address starting with 131.112 or 172.16-31) directly.

    For details of the network at Tokyo Tech, please contact the network administrator of the laboratory. If you are not sure, please contact the branch manager for each building or organization.

    (Tokyo Tech Users Only) Use the iMac terminal of Education Computer Systems

    If it is difficult to change the network configuration, you can bring your HDD to the GSIC and connect it to the iMac terminal of the Education Computer Systems in the exercise room to transfer the data. Please check the opening hours.

    Terminal room location and Opening Hours (in Japanese)

    Hardware (in Japanese)




  • Here we have a FAQ on Linux common errors.
    For details on how to use the described command, please check with the man command etc.

    1.No such file or directory

    There is no required file or directory.
    It occurs when specifying a nonexistent file, directory name, etc., typing, or incorrect path specification.
    Also, depending on the application, it may occur when the line feed code is CR + LF on windows.

    Measures
    Please review the file and directory name carefully.
    Also, please check FAQ "The job status is "Eqw" and it is not executed."about the newline character.


    There are related errors as follows.
    error while loading shared libraries: ****.so: cannot open shared object file:  No such file or directory
    This is an error that occurs when there is no library required by the program or can not be read.

    Measures
    Please check with ldd command.
    There is a way to set the environment variable LD_LIBRARY_PATH, explicitly specify the library at compile time, and so on.


    2.command not found

    The command you entered does not exist.
    This happens when the environment variable PATH setting is not successful or when there is no command.
    It is likely to occur when TSUBAME does not execute module command or if/etc/profile.d/modules.sh file is not loaded.

    Measures
    If the module command is not executed, execute the following command beforehand.

    $ . /etc/profile.d/modules.sh

    If the software is installed by yourself, check the environment variable PATH.


    3.Permission denied

    You are not authorized to perform the operation you attempted to perform.
    Linux and user and group permissions are set on a file / directory basis.
    Check the authority of the target file or directory you want to read or write or execute with the following command.
    (When checking the hoge file for an example)

    $ ls -l hoge

    Measures
    If you are trying to create files in / etc, / lib etc which are system directories etc, please make it in the user directory.
    If it occurs in a user directory such as a group disk, check the authority and please do.


    4.Disk quota exceeded

    Please check FAQ How to solve "Disk quota exceeded" error"".


    5. Out Of Memory 

    This error occurs when memory runs out.

    Measures
    Change the resource type to one with more memory capacity.
    Divide the memory usage per node with mpi etc.

    Related FAQ「Check the detail of an error message printed the log file


    Related FAQ
    "Disk quta exceeded" error is output
    The error when executing the qrsh command
    Check the detail of an error message printed the log file
    "Warning: Permanently added ECDSA host key for IP address 'XXX.XXX.XXX.XXX' to the list of known hosts." in the error log
    The range of support by T3 Helpdesk about the program error such as segmentation fault
    Error handling for each ISV application




  • The following SSH clients on Windows are available to connect TSUBAME.

    OpenSSH client (Windows 10 functionality)

    OpenSSH client can be installed via [Apps]-[Manage optional features] section in Settings app.

    ssh, ssh-keygen, etc commnads(same as linux) are available after the installation.

    PuTTY

    Official Site

    PuTTY is a free SSH Client software.
    Please refer to this article to generate the SSH key.

    Window Subsystem for Linux

    Linux environments can be constructed on Windows by downloading Linux distribution(such as Ubuntu, OpenSUSE) from Windows 10 store.

    ssh, ssh-keygen commands are available from that.

     

    Cygwin

    Official Site

    Cygwin provides a pseudo-Linux environment on Windows.

    ssh, ssh-keygen commands are available from that.




  • The login node has a limitation of 50 processes per user.
    Therefore, if you create a process that exceeds the limit, you will get an error like this.
    For more information, please refer to Please refrain from occupying the CPU in the login nodes..




  • Note: This article is about group disks (/gs/hsX/), do not run the following sample in your home directory.

    Users are not allowed to change the owner of their files. Therefore, please change the group permissions so that it can be read and written. The point is,

    • Change permissions for all files and directories below the directory, not just the top-level directory.
    • Add read (R) as well as write (W) permissions to the file. If you don't have a write (w), you can't erase it later.
    • The directory should contain not only read (r) but also write (w) and execute (x). You can't access the directory without the execution (x).
       

     Some example commands are shown below. Depending on the original permissions of the file, some errors may occur, in which case, try re-running the command until the output no longer changes.

    Find your own directories under /gs/hsX/tgX-XXXXXX/ and make them readable and writable by group members.

    find 

    /gs/hsX/tgX-XXXXXX/

    -type d -user $USER ! -perm -2770 -print0 | xargs -r0 chmod -v ug+rwx,g+s

    Find your own files under /gs/hsX/tgX-XXXXXX/ and make them readable and writable by group members.

    find 

    /gs/hsX/tgX-XXXXXX/

    -type f -user $USER ! -perm -660 -print0 | xargs -r0 chmod -v ug+rw

    Find your own files under /gs/hsX/tgX-XXXXXX/ and match the ownership group to the TSUBAME group.

    find 

    /gs/hsX/tgX-XXXXXX/

    -user $USER ! -group (TSUBAME group name) -print0 | xargs -r0 chgrp -v (TSUBAME group name)

     

    We have prepared a script that automatically executes the above commands. Please note that we do not guarantee the operation of this script, and you use it at your own risk.

    module load takeovertool
    cd /gs/hsX/tgX-XXXXXX
    fixperm

     




  • The advantage of the rsync command is that it transfers only the difference. If the transfer is interrupted for any reason, you can start again, or if you run it again after a certain period of time, you can transfer only those files that have changed their content. Data deleted from the source can also be deleted at the destination for complete synchronization.

     An example command is shown below. It's a good idea to check the log or run it multiple times, in case the command fails along the way.

     

    Synchronize TSUBAME with the data of the terminal on your local PC.

    rsync -auv (source directory) (your login name)@login.t3.gsic.titech.ac.jp:(full path of the destination directory)

    Synchronize TSUBAME data to the terminal on your local PC.

    rsync -auv (your login name)@login.t3.gsic.titech.ac.jp:(full path of the source directory) (destination directory)



  • Please refer to the following page for an example of how to write an acknowledgement.
    Please note that this is just an example, and you may adjust the description to match the description of other supercomputers or research funds.

    Please mention TSUBAME usage in acknowledgement of publications

    In addition, please submit reports on your use of TSUBAME, such as bibliographic information, through TSUBAME Portal to help us understand how TSUBAME is being used.
    Please refer to the following User's Guide for how to submit usage reports.

    TSUBAME portal User's Guide 12.Management of TSUBAME usage report




  • As the number of files per directory increases, the processing time for metadata operations (file creation, deletion, and opening) on the files under the directory increases, or the file system may generate errors, making it impossible to create files.

    Even when using a group disk, it is recommended to arrange files in a hierarchical manner so that there are no more than 100,000 files per directory.

    Example:

     

    • NG: 000000.dat ~ 999999.dat
      • If a million files are placed flat in one directory, the load during file access will increase, causing performance degradation and failure.
    • OK: 000/000000.dat ~ 000/000999.dat, 001/001000.dat ~ 001/001999.dat, …
      • The hierarchical arrangement minimizes the cost of file system operations by limiting the number of files per directory to about 1000.


Account


  • If the target user can not be searched on the user addition screen of the portal, there are the following reasons.

    • TSUBAME account application has not been completed
      The TSUBAME account is not automatically created as you enroll, and each user needs to apply for an account via the Tokyo Tech Portal. Until the account application is completed, it will not appear in the account search by the group administrator.
      In addition, if the applicant is an owner of an access card, the account application from the TSUBAME portal will only be valid if it has the same status as the Tokyo Tech staff/student. Also, because it is not automatically approved, it is necessary to send a document certifying your identity. So please complete the procedure with reference to the explanation of that page.
      For details of account application, please refer to "Getting accounts".

    • Attribute of user and attribute of TSUBAME group do not match

      When creating a TSUBAME group, you should have set up what kind of users can belong as group divisions. You can not add users who are out of that condition. For example, you can not add external users to "Only teachers and students at Tokyo Institute of Technology Group". In such a case, a message like "This user is not eligible to participate" is displayed. Because group classification can not be changed afterwards,so please create a group of appropriate attribute that allows you to subscribe to that user.
      In addition, if a new entitlement to an existing account occurs, such as when a user using HPCI newly started a joint research with Tokyo Institute of Technology with competitive funding, by appending an existing account name when applying for the new qualified account, you will be able to belong to that group.




  • If you need to change your login name for higher education, etc.
    GSIC uses the information as of the first day of each month to make a batch change around the 10th of the month.

    Therefore, there is no need for you to apply for or re-register for a new TSUBAME account.
    Until the change, please use your old TSUBAME login name.

    Please note that the timing of the change may not coincide with the timing of the Tokyo Tech IC card, which is the source of the information, and it may take a long time for the change to be completed.

    If you wish to change your login name as soon as possible for some special reason, please contact us using the contact form.

    Related Link:

     How to get an account/account login name
     https://www.t3.gsic.titech.ac.jp/getting-account#login_name

     

     



TSUBAME point and TSUBAME portal


  • -If JavaScript is disabled, please enable it and try again.
    -In some environments using Internet Explorer, there was a problem that TSUBAME portal account application screen was not displayed properly. We have fixed on August 3, 2017, but please let us know if you have any problems.
    -If it does not work with your browser, please use another browser such as Edge, Chrome, Firefox, Safari.




  • If you can not join the group just by displaying the login screen, when clicking on the URL on the mail titled "TSUBAME 3.0 TSBAME group user invitation", please check below.

    - Login and keeping the logged-in browser open, click the mail link again.
    - Depending on the mailer, the end character "=" may not be included in the link. In that case, copy paste it to the browser including "=".
    - The deadline of the invitation URL is one week. If it expires, try to join the group again.




  • It may take up to 5 minutes for the information to be reflected on the login node after adding users, groups from the portal system. If information is not updated from the login node, please wait for about 5 minutes before operation.




  • TSUBAME point is consumed when you submit a job or purchase a group disc.

    1. How to confirm points
    You can check the following information at TSUBAME3 Portal and command line.
    In the case of the portal, you can check it below.

    • Number of points consumed for each job submitted to the batch job scheduler
    • Points consumed in purchasing group disc
       

    For the command line please check the following
    FAQ How to check TSUBAME points, group disk usage, home directory usage

    2. Consumption of points
    Points will be consumed based on Article 13 here.
    The points consumed depend on the user and usage.
    Please confirm here for details.


    3. Purchase points
    Points can be purchased at the TSUBAME 3 portal.
    For details, please check TSUBAME Portal User's Guide.

    4. Point expiration date
    Points purchased within the fiscal year will expire at the end of that fiscal year. For details, please refer to the following (in Japanese).

    8.1.     Points available for purchase and validity period
    https://helpdesk.t3.gsic.titech.ac.jp/manuals/portal.en/point/#point_expiration

    課金等に関する取扱い「第6条 ポイントは,購入した年度に限り有効とする」("Treatment concerning billing. 6. points are valid only for purchased fiscal year")
    http://www.somuka.titech.ac.jp/reiki_int/reiki_honbun/x385RG00001339.html




  • "予算責任者の追認待ち" is a state waiting for approval from the budget manager of applied payment code.
    Please ask the budget manager to approve in TSUBAME portal.




  • TSUBAME points, group disk usage status, home directory usage status can now be checked with commands.
    The command "t3-user-info" can be used only with the login node(login0,login1) .
    It can not be executed on compute nodes.

    Each command option is required for confirmation.

    GSICUSER@login1:~> t3-user-info 
    usage: t3-user-info [command] [sub command] [option]
        [command] [sub_command] [option]
        group      point        : Output the points of the Tsubame Group.
                                : Without option  : show all belonging groups.
                                [-g] <group name> : extract the specified group.
        disk       group        : Output the purchase amount and use amount of Tsubame Group disk.
                                : Without option  : show all belonging groups.
                                [-g] <group name> : extract the specified group.
                   home         : Output the use limit and used of the home disk.
    • Command examples for checking group points.

    In the following example, it is assumed that "GSICUSER" user participating in the group "GSIC_GROUP", "GSIC" executes the command.Please use "your user name" and "participating group name" when actually executing the command.

    1.When checking the situation of all participating groups

    You can check the situation where TSUBAME points of "GSIC_GROUP" and "GSIC" of participating group are 17218631 and 95680000 , and deposit is 0, 124000 spectively.

    GSICUSER@login1:~> t3-user-info group point

    gid     group_name                        deposit      balance
    --------------------------------------------------------------
    0007    GSIC_GROUP                              0    17218631
    0451    GSIC                                124000  995680000

    2.When checking the status of the designated group

    You can check the situation where the TSUBAME point of the specified GSIC_GROUP group is 17218631 and deposit is 124000.

    GSICUSER@login1:~> t3-user-info group point -g GSIC_GROUP
    gid     group_name                        deposit      balance
    --------------------------------------------------------------
    2007    tga-hpe_group00             124000   17218631
    • To check the usage status of the group disk

    In the specified GSIC_GROUP group, only / gs / hs1 is purchased, about 60 TB of the 100 TB quota limit is used,
    Regarding the inode limit, we can check the situation of using 7.5 million out of the 200 million quota limit.

    GSICUSER@login1:~> t3-user-info disk group -g GSIC_GROUP
                                                      /gs/hs0                                 /gs/hs1                                 /gs/hs2                
      gid group_name                 size(TB) quota(TB)   file(M)  quota(M)  size(TB) quota(TB)   file(M)  quota(M)  size(TB) quota(TB)   file(M)  quota(M)
    --------------------------------------------------------------------------------------------------------------------------------------------------
    0007 GSIC_GROUP                0.00         0      0.00         0     59.78       100      7.50       200      0.00         0      0.00         0
    • When checking the use status of the home directory

     Of the 25 GB quota limit, 7 GB is used,
    Regarding the inode limit, we can check the situation that we are using approximately 100,000 out of the 2 million quota limit.

    GSICUSER@login1:~> t3-user-info disk home
      uid name         b_size(GB) b_quota(GB)    i_files    i_quota
    ---------------------------------------------------------------
     0177 GSICUSER              7          25     101446    2000000



  • TSUBAME3 collects TSUBAME points as "temporary points" that are expected to be needed when submitting a job, and settles the actual consumption points after the job is finished.
    The timing of TSUBAME point consumption and return is as follows.

    1. When a job is submitted or qrsh command is executed
      The maximum TSUBAME points that a job can consume are collected as temporary points.
    2. When a job is terminated
      The timing of settling provisional points is different in the following three ways
      1. When a job finishes as usual, or when a job is canceled by the qdel command after it has started running
        Recalculate the consumption points based on the time actually used by the job, and return the difference immediately.
      2. When a job is canceled by the qdel command before execution starts, or when the qrsh command fails to start
        The provisional points will remain collected and the temporary points will be automatically returned within three days of the cancellation or execution failure.
        Please contact us if the display on TSUBAME portal remains "Processing" for more than 3 days.
      3. When a job is deleted by the scheduler or the system administrator due to a system error
        When a job is deleted by the scheduler or the system administrator due to a system error, it is basically settled based on the time when the job was deleted as in a.
        If the cancellation is clearly due to system reasons, we will compensate you for the wasted TSUBAME points if you contact us.

    If you have any questions about TSUBAME points, please contact us using the inquiry form with the following information.

    • The TSUBAME3 group that submitted the job
    • User who submitted the job
    • Job ID

    Reference:
     FAQ: The error when executing the qrsh command




  • Some links automatically generated by the TSUBAME Portal, such as links to approval pages of TSUBAME group invitations, may not function properly depending on your browser or mailer environment. Specifically,

    • The login page of the TSUBAME Portal is displayed instead of being redirected to the correct page.
    • The message like "The referenced URL has expired" is displayed.

    If you see this kind of symptom, try copying and pasting the address sent by e-mail to the address bar of the window in which you are logged in from the Tokyo Tech Portal to TSUBAME Portal, and then press the Enter key.

    If you still think the system is not working properly, please contact us using the contact form with the following information.

    • Your TSUBAME account login name
    • The date and time of the email sent to you
    • The approximate date and time of the first click on the link
    • Other related TSUBAME Group information, etc.



  • When you apply for a payment code on TSUBAME Portal, you will receive a notice of rejection from the TSUBAME Portal if your application is incomplete.
    If your application is incomplete, you will receive a rejection notice from TSUBAME Portal. Please check the reason for the rejection in the "Comments from the system administrator" section of the notice, and make corrections if necessary.
    The following is a list of typical reasons for rejection and what you need to check when you resubmit.

    1. The existence of the corresponding budget could not be confirmed (該当する予算の存在を確認できませんでした)

    The above message is sent when the person in charge has checked the existence of the budget code and budget name pair entered in the payment code application on the financial accounting system, but could not find the corresponding budget, including the possibility of a typo.

    If the budget is for external funds or Grant-in-Aid for Scientific Research, please make sure that the budget code for the current fiscal year has already been created in the Request for Supplies system.
    Also, please make sure that the budget code, budget name, budget department, budget category, and person in charge of the budget are entered correctly, because if they are not, we will not be able to search them.
    In particular, the budget code (32 digits) and budget name have a large number of digits in the financial accounting system from FY2020, and some applications may be missing some of them when transcribing, so we strongly recommend that you follow the procedure below to obtain them.

    • Log in to the New Goods Request System
    • Select Budget Management
    • Select Check Budget Execution Status
    • Select Create CSV
    • Post the "Budget Code" (32 digits) and "Budget Name" in the output CSV file. 

    2. I was able to confirm that I had the budget code, but XXX was different. (予算コードがあることまでは確認できましたが、XXXが異なっていました)

    The above message is sent when a budget that seems to be applicable is found based on the information entered in the payment code application, but some items do not match the registered contents on the financial accounting system and are incorrect beyond the scope that can be corrected at the discretion of the person in charge as a typo.

    Please refer to the previous section, check the registration details displayed in the goods billing system, and then reapply with the correct details.
    Please refer to the previous section to check the registration details displayed in the requisition system, and reapply with the correct details. Please note that if the XXX part is the name of the person responsible for the budget, please refer to the next section.

    3. I was able to confirm that there was a budget code, but the name of the person responsible for the budget was different. (予算コードがあることまでは確認できましたが、予算責任者氏名が異なっていました)

    The above message will be sent when the budget manager's account information on the payment code application does not match the budget manager's information on the financial accounting system.

    TSUBAME requires budget managers, including those who do not directly use TSUBAME's computing resources, to obtain a TSUBAME account, agree to the Terms of Service, and fill out their account information in order to ensure that we can communicate with them about their billing needs.
    In addition, if the user of the payment code (the main administrator of the TSUBAME group) is different from the budget manager, the budget manager is required to approve the application for the payment code on the TSUBAME portal in order to confirm that the use of the budget is allowed in advance.

    4. Rejected because the expense category is not "Other" (費目がその他ではないため、却下します) (Grant-in-Aid for Scientific Research)

    You will receive the above message when you specify a budget code for an expense category other than "Other" of the Grant-in-Aid for Scientific Research (goods, travel, and personnel expenses) in the payment code application.

    There are four types of budget codes for Grants-in-Aid for Scientific Research, and TSUBAME's computer usage fees are classified as "Other" expenses. (This corresponds to "Facility and equipment usage fees within the research institution" in the table of expense categories common to all government ministries and agencies.)
    For this reason, applications for payment codes for expenses other than "Other" will be rejected as an incorrect expense category.

    5. If you are not a faculty or staff member, you must be responsible for your own budget. (教職員以外の方は、自身が予算責任者となっている申請に限らせていただいております)

    The above message is sent to non-faculty members when they apply for a payment code for a budget for which they are not the budget manager.

    Non-faculty members are not allowed to be the payer of TSUBAME for budgets other than those for which they are the budget manager (e.g., Grant-in-Aid for Scientific Research by JSPS Postdoctoral Fellows).
    If you need to use such a budget, please ask an appropriate faculty or staff member to be the payer and reapply for the payment code.
    If you have already created a TSUBAME group, please change the group administrator (main) if necessary.

    6. The period in which new claims can be generated has already passed. (既に新規請求事項発生可能期間を過ぎています)

    The above message will be sent when you cannot purchase new TSUBAME points for your requested budget.

    In TSUBAME, purchase operations on the TSUBAME portal are grouped together on a monthly basis, and the budget is transferred to the next month or later.
    We have established a period of availability for payment codes so that this transfer process can be done within the budget's available period (within the fiscal year and research period), but we will not accept applications for budgets that have already passed this period and cannot be used to purchase points.
    Please note that during the period from January to March, only corporate operating expenses and scholarship donations can be registered and used for payment codes, but the amount used from January to March will be charged to the same budget for the following year.

    7. The system administrator has already approved the application for the same fiscal year and the same budget code. (既に同じ利用年度かつ同じ予算コードの申請をシステム管理者が承認済みです)

    You will receive the above message when a payment code has already been approved for the same budget.
    Please use the approved payment code.



Job Execution (Scheduler)


  • The correspondence varies depending on the error.

    • qsub: Unknown option

    The "qsub: Unknown option" error also occurs when there is an error in the line description starting with "#$" in the job script, besides the option of the qsub command. A common mistake is putting a space before and after the character "=". Please try deleting the space around "=".

    • Job is rejected, h_rt can not be longer than 10 mins with this group

    If you do not specify the TSUBAME group with the -g option or newgrp, it is considered "Trial run".
    "Trial run" has limitation only within 10 minutes, this error occurs when the specified h_rt option specifies more than 10 minutes

    In case of "Trial run" please modify the h_rt option to 0:10:0.
    If you want to execute it other than "Trial execution", specify the TSUBAME group with the -g option or newgrp.
    * In this case, please confirm that you are participating in the appropriate TSUBAME group and that the TSUBAME group has points.
    For the TSUBAME group please check TSUBAME portal usage guidance

    • Unable to run job: Job is rejected. h_rt must be specified.

    It can not be executed because there is no description of h_rt option. Please set time and execute.

    • Unable to run job: the job duration is longer than duration of the advance resavation id AR-ID.

    This error occurred because you specified a time longer than the reserved time.
    Please refer the reservation of FAQ below.

    Related FAQ
    About specification of batch job scheduler

     

    • error: commlib error: can't set CA chain 

    This error occurs when the certificate file required for submitting the automatically generated job in the home directory does not exist or is broken.

    If you face this error, please try to re-generate it by logging in to TSUBAME after executing the following.

    $ cd $HOME

    $ mv .sge .sge.back

     




  •  

    There is a possibility of a system failure, but it may be due to a job script mistake.
     

    Please confirm with the following command.

    $ qstat -j <job ID> | grep  error

    Please confirm the following points. After checking, delete jobs with "Eqw" status with qdel command.

    Example)
    When there is a problem with file permission.

    error reason    1:         time of occurrence [5226:17074]: error: can't open stdout output file "<
    File of the cause>": Permission denied


    When there is no line feed code problem, directory does not exist, or job script is invalid.

    error reason    1:          time of occurrence [5378:990988]: execvp(/var/spool/uge/<hostname>/job_scripts/<jobID>, "/var/spool/uge/<hostname>/job_scripts/<jobID>") failed: No such file or directory

    1. The line feed code of the job script is not in UNIX format.

    If the line feed code is set to CR + LF on windows, it will also occur, so please confirm with the actual script together.

    You can confirm with the file command.

     $ file <script file name>

    #Output in case of  CR + LF

    <Script file name>: ASCII text, with CRLF line terminators

    #Output in case of  LF

    <Script file name>: ASCII text

    You can also confirm cat command.

    $ cat -e <script file name>

    #Output in case of CR + LF

    The end of line is displayed as ^M$

    #!/bin/bash^M$
    #$ -cwd^M$
    #$ -l f_node=1^M$
    #$ -l h_rt=0:10:00^M$
    ./etc/profile.d/modules.sh^M$
    module load intel^M$

    #Output in case of LF

    The end of line is displayed as $

    #!/bin/bash$
    #$ -cwd$
    #$ -l f_node=1$
    #$ -l h_rt=0:10:00$
    ./etc/profile.d/modules.sh$
    module load intel$

    • Do not edit scripts on Windows
    • When editing a script on Windows, make sure to check the line feed code by using an editor corresponding to the line feed code.
    • Correct the line feed code to LF with nkf command.
      • In case of other than LF, execute below command.

        $ nkf -Lu file1.sh > file2.sh 

    Note: file1.sh is an original file (before conversion) and file2.sh is a converted file, respectively. Both file names must be different. If their names are identical, it will be corrupted. 

    2. There is no such directory to be executed

    Occurs when the execution directory described in the job script does not exist.

    Please confirm with the following command.

    $ qstat -j <job ID> | grep ^ error

    error reason 1: 09/13/2017 12: 00: 00 [2222: 19999]: error: can not chdir to / gs / hs 0 / test - g / user 00 / no - dir: No such file or directory

    3. The job script is described in the background job (with "&")

    It will not be executed if it is entered and submitted as a background job (with "&") as shown below.

    Example)

    #!/bash/sh
    #$ -cwd
    #$ -l f_node=1
    #$ -l h_rt=1:00:00
    #$ -N test

    ./etc/profile.d/modules.sh

    module load intel

    ./a.out &


    4. When there is no file permission

    Please set permissions appropriately.

    Example) Grant read and execute permission to myself

    $ chmod u+rx script_file

    5. Disk Quota

    Please check the group disk quota.
    It is about 2 million inodes per 1 TB.

    Please refer to FAQ below.

    Checking the usage of group disks with command

     




  • See "How to terminate the programs executed accidentally" for the deletion of the processes running on the login nodes.

    When the job-ID is known

    Terminate the job with qdel command as follows.

    $ qdel job-ID

    If job-ID is 10056, type

    $ qdel 10056

    When the job-ID is unknown

    Confirm the job-ID with qstat command, then incompleted jobs of the user are displayed.

    Example: When GSIC user confirms the incompleted jobs, displayed as follows.

    $ qstat
    job-ID  prior  name user  state submit/start at     queue jclass slots ja-task-ID 
    ------------------------------------------------------------------------------------------
    10053 0.555 ts1     GSIC   r     08/28/2017 22:53:44 all.q          28
    10054 0.555 ts2     GSIC  qw     08/28/2017 22:53:44 all.q         112
    10055 0.555 ts3     GSIC  hqw    08/28/2017 22:53:45 all.q          56
    10056 0.555 eq1     GSIC  Eqw    08/28/2017 22:58:42 all.q           7

    TIPS. Status of jobs

    state 説明
    r Running
    qw Waiting in order
    hqw Waiting for other jobs to finish because of the dependency
    Eqw Error for some reason

    Delete jobs with Eqw by yourself. See here for the cause of it.
    Refer to here if you want to change the status of a job to hqw.




  • Please check a stacked line chart in Job monitoring page. 

    To check whether there are free compute nodes, see the green area on the chart.




  • TSUBAME 3.0 provides the following scratch area.
    For details, please refer to Storage use on Compute Nodes in the TSUBAME3.0 User's Guide 

    1. Local scratch area
    The environment variable $TMPDIR that is allocated only on compute node is the local scratch area.

    $ TMPDIR is usually a unique directory for each job under /scr.

    You can not write directly under /scr.
     

    2. Shared scratch area
    It is available only for the batch job script using f_node of resource type F. Please specify "#$ -v USE_BEEOND=1".
    /beeond directory is allocated.

    3. /tmp directory
    There is a 2GB capacity limit for the /tmp directory.
    There is a concern that problems such as hanging up of the execution program may occur when creating large scratch files.
    Please consider using the scratch directories in 1. and 2.




  • For SSH login to compute nodes only f_node is possible.
    Please use f_node when executing applications that use SSH when doing MPI communication. 

    For details, please check the section 5.7 SSH login of "TSUBAME 3.0 User's Gude".
     




  • If you want to execute batch job A-2, as soon as the batch job named A-1 finishes, please use the -hold_jid option to submit the job as shown below.

    $ qsub -N A-1 MM.sh
    $ qsub -N A-2 -hold_jid A-1 MD.sh

    If you issue the qstat command afer submission, the status will be "hqw".

     




  • If you want to multiple calculations in one job by executing batch, for example executing the four commands exec1, exec2, exec3, exec4 at onece, write the batch script as follows.

    #!/bin/sh
    #$ -cwd
    #$ -l f_node=1
    #$ -l h_rt=1:00:00
    . /etc/profile.d/modules.sh
    module load cuda/8.0.61
    module load intel/17.0.4.196

    exec1 &
    exec2 &
    exec3 &
    exec4 &
    wait

    The above is only an example.

    If you want to execute programs located in different directories at onece, you need to write the executable file from the path. For example, if you want to directly execute  a.out in folder1 of the home directory, you specify as below.

    ~/folder1/a.out &

    If you need to the directory of the executable file and execute it.

    cd ~/folder1
    ./a.out &

    Or,

    cd ~/folder1 ; ./a.out &

    If the last  line of the script file ends with "&", the job wil not run.

    Do not forget to write the last wait command of the script.
     




  • Typing the commands as described in the manual, the calculation starts on the login node before executing the qsub command.

    GSICUSER@login1:~> #!/bin/bash
    GSICUSER@login1:~> #$ -cwd
    GSICUSER@login1:~> #$ -l f_node=2
    GSICUSER@login1:~> #$ -l h_rt=0:30:0
    GSICUSER@login1:~> . /etc/profile.d/modules.sh
    GSICUSER@login1:~> module load matlab/R2017a
    GSICUSER@login1:~> matlab -nodisplay -r AlignMultipleSequencesExample

    This is because you are directly executing commands on the shell that you need to write in the batch script.
    Instead of executing them directly, create a batch script file and specify it with the qsub command.

    If you are not familiar with terms such as batch script and shell,  please see "1.Beginners of UNIX/LINUX" at "I'm a beginner, I don't know what to do."




  • Explain the error when running qrsh.

    1.Your "qrsh" request could not be scheduled, try again later.
    The error above indicates that there are no available vacant resource for interactive job. 
    Please retry it after the resource become available.

    See "I'd like to check the congestion status of compute node" for the status of the use of compute node.

    2.Job is rejected. You do NOT have enough point to finish this job
    This error indicates that there is no TSUBAME points required for assuring the node.
    Please check the point balance.

    Reference: FAQ "How long will it take for TSUBAME points to be returned?*
     

    3.Unable to run job: unable to send message to qmaster using port 6444 on host "jobconX": got send error.
    Exiting.

    This error occurs when the UGE server side is under heavy load.
    Please wait a while and try again.

     




  • The following message may be printed to the log file in some case.

    /var/spool/uge/hostname/job_scripts/JOB-ID: line XX: Process-ID Killed Program_Name

    In this case, type the qacct command to check the job in detail.

    $ qacct -j JOB-ID

    The following is an output example of the qacct command. (Excerpt)

    ==============================================================
    

    1.Example when the memory resource is exceeded

    $ qacct -j 4500000

    qname        all.q               
    hostname     r0i0n0              
    group        GSIC          
    owner        GSICUSER00            
    project      NONE                
    department   defaultdepartment   
    jobname      SAMPLE.sh
    jobnumber    4500000             
    taskid       undefined
    account      0 0 1 0 0 0 3600 0 0 0 0 0 0
    priority     0      
    cwd          /path-to-current
    submit_host  login0 or login1    
    submit_cmd   qsub -A GSICGROUP SAMPLE.sh
    qsub_time    %M/%D/%Y %H:%M:%S.%3N
    start_time   %M/%D/%Y %H:%M:%S.%3N
    end_time     %M/%D/%Y %H:%M:%S.%3N
    granted_pe   mpi_q_node          
    slots        7                   
    failed       0    
    deleted_by   NONE
    exit_status  137                              
    maxvmem      120.000G
    maxrss       0.000
    maxpss       0.000
    arid         undefined
    jc_name      NONE

    you need pay atttention to exit_status, accont and maxvmem in the example.
    exit_status provides the cause of the error by exit code. The exit_status 137 indicates 128 + 9, but since the status occurs in various problem, you may not determine.

    Then check the granted_pe and maxvmem.

    The "0 0 1 0 0 0" of account shows which resource type and how much it used.
    The space type indicates the resource type of f_node, h_node, q_node, s_core, q_core, s_gpu, and the number indicates the resource amount. In this example, one q_node is used.
    The maximum memory usage, respectively.
    It is estimated that 120 GB of memory was about to be actually used although up to 60 GB is available in q_node according to the User's Guide
     

    In TSUBAME, the job is killed automatically if the job used more memory size than assigned.


    2.Example when the reserved time is exceeded

    $ qacct -j 50000000
    qname        all.q               
    hostname     r0i0n0              
    group        GSIC          
    owner        GSICUSER00            
    project      NONE                
    department   defaultdepartment   
    jobname      SAMPLE.sh
    jobnumber    50000000             
    taskid       undefined
    account      0 0 1 0 0 0 3600 0 0 0 0 0 0
    priority     0      
    cwd          /path-to-current
    submit_host  login0 or login1    
    submit_cmd   qsub -A GSICGROUP SAMPLE.sh
    qsub_time    %M/%D/%Y %H:%M:%S.%3N
    start_time   %M/%D/%Y %H:%M:%S.%3N
    end_time     %M/%D/%Y %H:%M:%S.%3N
    granted_pe   mpi_f_node          
    slots        7                   
    failed       0    
    deleted_by   NONE
    exit_status  137
    wallclock    614.711                              
    maxvmem      12.000G
    maxrss       0.000
    maxpss       0.000
    arid         undefined
    jc_name      NONE

    you need pay atttention to exit_status, wallclock in the example.
    exit_status provides the cause of the error by exit code. The exit_status 137 indicates 128 + 9, but since the status occurs in various problem, you may not determine.

    So I will focus on account, wallclock.
    The seventh digit of the account space break indicates the time (sec) for securing resources.
    In this example it is 600 seconds.

    Wallclock shows the elapsed time, which is 614 seconds in this example.

    Since the calculation did not end within the resource securing time, it can be inferred that the job was forcibly terminated



    Related URL: About common errors in Linux




  • This FAQ will explain how to transfer X with qrsh.
    In this method, you can use GUI applications on other than f_node.
    Please follow the procedure below.

    (Preliminary Work)
    Enable X forwarding and ssh to the login node.
    Reference: FAQ "X application (GUI) doesn't work" section 1 and 2.

    1.After logging in to the login node, execute the following command.
    In the example below, GSICUSER will use s_core from login1 for one hour.
    Please change the blue letter according to your group, resource, and time you want to use

    With the scheduler update implemented in April 2020, you no longer need to specify -pty yes -display "$DISPLAY" -v TERM /bin/bash when executing qrsh.

    GSICUSER@login1:~> qrsh -g GSICGROUP -l s_core=1,h_rt=0:10:00

    2.Run the X application you want to use.
    The following is an example with imagemagick.

    GSICUSER@r1i6n3:~> . /etc/profile.d/modules.sh
    GSICUSER@r1i6n3:~> module load imagemagick
    GSICUSER@r1i6n3:~> display

    imagemagick

    Notes
    -Depending on the GUI application, there are applications that can not be activated or calculated by limiting memory or SSH.
    -For memory, please use the appropriate resource type.
    -Fluent can not be launched due to SSH restriction. To avoid this, use the -ncheck option (not supported by manufacturer).
    -Schrodinger can be launched but can not compute by SSH restriction. You can use on f_node only.




  • It is the message that sytstem added the certificate of IP address XXX.XXX.XXX.XXX in the known_host file(SSH server certificate list) , when there is a node connected for the first time or when the certificate of the host which had previously connected has been changed. This is a normal operation, and it does not affect the calculation result and can be ignored.




  •  

    This section explains the error messages that occurs after executing qsub command and its remdy. 

     

    Unable to run job: Job is rejected because too few parameters are specified.

    A required parameter is not specified. You need to specify resource type and number of resources, and execution time. 

     

    qsub: Unknown option

    There is an roor in qsub option specification. Please refer to this

     

    Unable to run job: Job is rejected. core must be between 1 and 2.

    3 or more resources per job can not be used for trial execution. Specify 1 or 2 as the number of resources. 

     

    Unable to run job: Job is rejected, h_rt can not be longer than 10 mins with this group.

    For trial execution, you can not submit jobs whose execution time exceed 10 minutes. Please refer to this

     

    Unable to run job: Job is rejected. You do Not enough point to finish this job. 

    Points to secure the specified resources and time are insufficient. Please check the point status from the TSUBAME portal page. 

     

    Unable to run job: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).  or  Unable to run job: got no response from JSV script"/apps/t3/sles12sp2/uge/customize/jsv.pl".

    Communication with the job scheduler will time out and the above error message may be displayed if the management node becomes in a state of high load due to a large amount of job input in a short time. The state of high load is temporary. Please try again after waiting a while. 




  • TSUBAME 3 uses the batch job scheduler

    Resource Type

     

    There are six available resource types as follows. Specify the resource type with "-l" option. (The "-pe" and "-q" options are not available.)

      Resource Type Name No. of used CPU core memory (GB) No. of GPU
    F f_node 28 235 4
    H h_node 14 120 2
    Q q_node 7 60 1
    C1 s_core 1 7.5 0
    C4 q_core 4 30 0
    G1 s_gpu 2 15 1


    Job submission method

    Job can be submitted from the login node with the following command.
     - Submission by job script (when user belonging to GSICGROUP executes train.sh)

    qsub -g GSICGROUP train.sh

     

     - When executing an interactive job (when a user belonging to GSICGROUP uses s_core under X environment for 2 hours)

    qrsh -g GSICGROUP -l s_core=1,h_rt=2:: -pty yes -display $DISPLAY -v TERM /bin/bash


    For details of how to input jobs, such as how to specify the resource type by submission by job script, please refer to the usage guide.
    User's Guide 5.2. Job submission

    Also, please check the related FAQ below for the items not explained here.
    Related FAQ
    How to use scratch area
    About submission method of dependent job
    How to transfer X with qrsh
     

    About job limit

    Please check "Various limit value list" about the current limit.
    If the submitted job exceeds the per-user limit, it will be kept in wait state "qw" even though there are enough idle nodes in TSUBAME3.
    Once the other jobs terminate and the job fits in the per-user limit, it becomes running state "r", if there's enough idel nodes.

     

    About reservation

    Reservation can be set in units of one hour and the node can be used 5 minutes before the reservation end time.
    When submitting a job, it needs to be executed with the following command. AR ID can be confirmed on the portal.

    $ qsub -g GSICGROUP –ar ARID YOURSCRIPTFILENAME

    Since it is used up to 5 minutes before the reservation end time, you need to devise the -l option of the job script.
    Example) Resource specification when reservation period is 2 days

    #$ -l h_rt=47:55:00

    "Reservation" does not apply to the above "Job limits", and has the "Reservation" restriction.
    Please check "Various limit value list" about the current limit.

    Please check the related FAQ below for coping with error.
    Related FAQ
    "qsub: Unknown option" error occurs when submitting the job, but I do not know which option is bad
    The job status is "Eqw" and it is not executed.
    The error when executing the qrsh command
    Check the detail of an error message printed the log file

     




  • We summarize the troubleshooting when jobs can not be submitted during reservation execution.
    The following command is an example where the GSIC group executes the AR number 20190108 which is used on 2days.

    1.Forgot to add ARID

    Example of NG
    When the following command is executed, it is executed as a normal job.

    $ qsub -g GSIC hoge.sh

    OK example
    Be sure to use the -ar option when making reservation execution.

    $ qsub -g GSIC -ar 20190108 hoge.sh

     

    2.h_rt longer than reserved time


    If the h_rt option time specification is longer than the reserved time, the job will not flow.
    Also, because it is a specification that will be used 5 minutes before the reservation end time, please shorten the specified time by 5 minutes from the reservation time.

    Example of NG
    It is not executed because reservation time is full.

    $ grep h_rt hoge.sh
    #$ -l h_rt=48:00:00
    $ qsub -g GSIC -ar 20190108 hoge.sh

    OK example (end time is -5 minutes)

    $ grep h_rt hoge.sh
    #$ -l h_rt=47:55:00
    $ qsub -g GSIC -ar 20190108 hoge.sh


    When executing after the reservation start time, such as when the program terminates abnormally or when a job can not be submitted before the reservation start time, it is necessary to consider elapsed time.
    For example, if you submit a job after 2 hours from the reservation start time, it will be the following script. (When one minute of internal processing time from qsub command execution to allocation of compute nodes)

    $ grep h_rt hoge.sh
    #$ -l h_rt=45:54:00
    $ qsub -g GSIC -ar 20190108 hoge.sh

    Related URL

    TSUBAME3.0 User’s Guide "5.3. Reserve compute nodes"

    TSUBAME portal User's Guide "9. Reserving compute nodes"

    About specification of batch job scheduler

    Main differences between TSUBAME 2.5 and TSUBAME 3.0 ( node reservation )




  • The time specified by h_rt also includes the time for preparation processing to execute the job submitted by the user. Therefore, the time specified by h_rt does not become actual job execution time.
    
    The points to be consumed are calculated based on the job execution time excluding the preparation processing time. And the preparation process time is not constant because it varies depending on the status of the node where the job is to be executed.
    



  • This error occurs when the module command has not been initialized.

     

    The module command can be initialized by adding . /etc/profile.d/modules.sh before module load XXXX, 

     

    実行シェルがsh, bashの場合でintelモジュールを読み込む場合

    The execution shell is sh, bash, load intel module

    . /etc/profile.d/modules.sh
    module load intel

     

    The execution shell is csh, tcsh, load intel module

    source /etc/profile.d/modules.csh
    module load intel

     

    pip等でインストールしたコマンドをジョブスクリプト・qrshから実行した場合に、"command not found"エラーが発生した場合はログインノードにて

    If "command not found" error occurred when executing a command which is installed by external installer such as pip, please try the following on the login node:

    $ type <command>
    <command> is hasehd (/path/to/<command>)

     

    Then you can confirm the path, and add the following to the job script:

    export PATH=$PATH:/path/to

    Here, /path/to is the directory where the command is located.

     


    related URLs

     

    About common errors in Linux

    User Guide




  • It is possible to run some programs on different CPUs/GPUs as follows.

    In this example, a.out uses CPU0-6+GPU0, b.out uses CPU7-13+GPU1, c.out uses CPU14-20+GPU2, d.out uses CPU21-27+GPU3.

    #!/bin/sh
    #$ -cwd
    #$ -V
    #$ -l f_node=1
    #$ -l h_rt=00:30:00

     

    a[0]=./a.out

    a[1]=./b.out

    a[2]=./c.out

    a[3]=./d.out

     

    for i in $(seq 0 3)
    do
        export CUDA_VISIBLE_DEVICES=$i
        numactl -C $((i*7))-$((i*7+6)) ${a[$i]} &
    done
    wait




  • When web service(jupyter lab) can not start, please check the following points.

    check the log and investigate what is happening

    Output files of web services are saved under ~/.t3was/.
    There might be some hints in the files.

    Initialize the environment

    Initialize python 3.6 environment such as moving ~/.local/lib/python3.6/ to another directory, and then do

    $ python3 -m pip install --user (modulename)

    module conflict might be solved.

    if web service was able to start by doing this, install some necessary modules from the jupyter lab console, and then

    $ python3 -m pip check

    this checks the dependencies and update problematic modules by doing

    $ python3 -m pip install -U --user (modulename)

    , the problem can be solved.

    Check and resolve module dependencies, avoiding initialization of the environment

    This is almost the same as the abobe, after SSH'ing to TSUBAME, do

    $ module load jupyterlab/3.0.9

    then the environment of Jupyter Lab in the web service is loaded.(3.0.9 is necessary)

    After this, execute

    $ python3 -m pip check

    and check if there is problematic modules in  the dependencies.

    If there is, do

    $ python3 -m pip install -U --user (modulename)

    and update them.



Application Usage


  • You can install modules into your home directory (Example: Theano case)

    $ module load python-extension/2.7
    $ pip install --user theano

    If you want to use modules from your compute job, add following lines into your job script before python executables.

    . /etc/profile.d/modules.sh
    module load python-extension/2.7

    related URL
    How to install numpy, mpi4py, chainer etc. using python/3.6.5




  • In this page, X applicatoin indicates the application that is installed in TSUBAME3 and can work on the X environment, that is GUI application.

    Please check the troubleshooting below.

    1. X server application is installed and active on the client PC

    ■Windows
    There are a lot of X server applications for Windows.
    Please confirm that one of them is installed and on active.

    ■Mac
    Please confirm XQuartz is installed and configured.
    https://support.apple.com/ja-jp/HT201341

    ■Linux
    Please confirm both of the X11 server application and its libraries are installed.


    2. The X transfer option in the terminal is enabled.
    ■ A Terminal on Windows (Except for Cygwin)
    The setting method differs depending on your terminal and X server application.
    Please check the manual of each application.

    ■Linux/Mac/Windows(Cygwin)
    Please confirm the ssh command contains the option -y and -c (these are the options for X transfer)

    $ ssh account_name@login.t3.gsic.titech.ac.jp -i key -YC

    Example: in case of gsic_user as account_name and ~/.ssh/t3-key as key, then

    $ ssh gsic_user@login.t3.gsic.titech.ac.jp -i ~/.ssh/t3-key -YC

    Please refer to the output of the following command for ssh option.

    $ man ssh 


    3. Error reproduces in another terminal/Xserver
    There are various free terminal softwares/X server applications for Windows.
    Please check the same error occurs another terminal/X server.
    It may be due to compatibility between terminal and X server.
    It may be compatible with isv application.
    If it does not reproduce in other applications, there is a possibility that it is an application specific problem.
    In that case we can not respond even if you contact us, please understand.

    In addition, depending on the X application, command options may be required.
    Please check the manual of X application you want to use.

     

    Some GL applications that do not work with normal X forwarding/VNC connection may work with VirtualGL, so please give it a try if needed.

    For the detail of VirtualGL, please refer to User's Guide.

     

    4. Operation check
    If it is in an interactive node, the standard terminal emulator of X Window System is started with the following command. Please confirm whether to start.

    $ xterm 

    If xterm works but the X application you want to use does not work, please try "3. Error reproduces in another terminal/Xserver"

    Example of failure

    xterm: Xt error: Can't open display: 
    xterm: DISPLAY is not set

    Please check 1 and 2 if the error occurs.

    5. Application use
    Do not execute programs that occupy the CPU at login nodes.
    Please use compute nodes for Full-scale use including visualization.

    Please refer to the FAQ below for information on using the GUI application at the compute node.
    Reference: FAQ "How to transfer X with qrsh"

    When using f_node, X transfer can be performed with the ssh -Y command.

    Please inform us of the following when you inquire
    ■Operating System you use (例 Windows10,Debian10,macOS Sierra10.12.6)

    ■Terminal environment that the error occurs (Cygwin, PuTTY/VcXsrv, Rlogin/Xming)

    ■Version
    For Windows, the both versions of the terminal and X server application.
    see the manuals for applications for checking versions.
    Please inform the version of SSH in case of using Linux/Mac with the command below.

    $ ssh -V

    ■Please send us the contents you tried so far, or if you get an error, please describe the error.




  • Please check if it applies to the following items.
    If applicable, you can install it freely at your own risk.
    Please check the installation manual and the license agreement of the application.

    • Works with OS installed in TSUBAME(SUSE Linux Enterprise Server 12 SP5). Software requiring Windows or Mac OS won't work.
    • Not requiring administrator privilege (root) to install it.
    • Possible to install it to your own home directory or group disk. (It is not allowed to install it to any specified nodes' local disk.)
    • With a valid license.  
    • Not requiring the change to the settings for the kernel, libraries or the system itself.
    • If only under these conditions, you can install it and use it on your own responsibility.
    • No need for GSIC support.


    As described above, GSIC will not help anything about the applications brought by users, as we do not know anything about it.

    In case of problems, users themselves must distinguish whether it comes from the application itself or the general issue of TSUBAME, and ask application vendors for application-specific problems.

    The versions of libraries and drivers may be changed at the time of the regular maintenance of TSUBAME etc. In that case, you might need to reconfigure the application you had used. Please be aware of the risk of losing compatibility in the future.

     




  • Since most of the troubles arising with regard to the distribution software are caused by the environment, we do not support them individually.
    Please solve yourself as you accepted at the time of application.

    Even if you contact us, we can not respond.
    Please read the following carefully.

    Distribution of software

    Applcation Software on your PC after 1 Aug

    The cause of the inquiry is currently one of the following.
    Both are caused by the user environment.
    · Old license setting
    · A network problem such as the laboratory and building where the client locates

    In the meantime we will continue with the TSUBAME 2.5 distribution rule.
    Distribution rules about COMSOL and schrodinger, which are newly introduced in TSUBAME 3, are to be developed.




  • When using the ISV application on TSUBAME3, there are the following two cases.

    • Perform all processing of pre / solver / post in TSUBAME3
    • Perform pre / post processing on client and perform solver processing with TSUBAME3

     

    1. When performing processing of pre / solver / post in TSUBAME3

    In TSUBAME 3, basically all the functions of pre / solver / post are introduced, and in case of execution at interactive node, it is possible to perform all processing of "pre, solver, post".
    How to run in interactive jobs and how to use each process depends on isv application. Please check the manual and user's guide of each application.

    2. When performing pre / post processing on client and perform solver processing with TSUBAME3

    Operation on TSUBAME may be unstable due to compatibility of X server. This problem can be avoided by performing pre-post processing on the client, so we distribute software.
    Software is provided for improved convenience. Please note that distribution may be canceled depending on the situation.

    The following procedure is necessary to perform pre / post processing on client and perform solver processing with TSUBAME3

    Step 1: Apply for software usage and obtain it
    Step 2: Install the software on the client
    Step 3: Perform pre processing with software on the client
    Step 4: Transfer the data created in Step 3 to TSUBAME
    Step 5: Create a batch script for submitting the job scheduler
    Step 6: Execute the qsub command in TSUBAME and execute the batch script created in Step 5
    Step 7: Transfer the result data of Step 6 to the client
    Step 8: Perform post processing with software installed on the client

    Refer to Distribution of software. (As of November 15, 2017)
    Applications newly introduced in TSUBAME3 are under preparation of distribution rules.
    (Distributed application is out of support range)




  • General
    Please check the following related FAQ first
    About common errors in Linux
    "Disk quta exceeded" error is output
    Error handling for each ISV application


    1. For ISV applications
    Supported. Please inform the following information through inquiry.

    ■ Application name
     Eg) Abaqus/Explicit
    ■ Error message
     Eg) buffer overflow detected
    ■ JOB_ID
     Eg) 181938
    ■ Host name where the error occurred
     Eg) r6i7n5
    ■The situation in detail
     Eg) The error occured when logged in?to r6i7n5 interactively with qrsh and executed the following command. Details are as follows:

     $ module load abaqus intel-mpi
     $ abq2017 interactive job=TEST input=Job1 cpus=6 scratch=$TMPDIR mp_mode=mpi
     

    #Error#
     Run package
    *** buffer overflow detected ***: 
    /pathto/package terminated
    ======= Backtrace: =========
    /lib64/libc.so.6(+0x721af)[0x2aaab0c001af]

    (The rest is ommited)

    ABAQUS is an academic license, so there is no technical support.
    It is necessary to register on the SIMULIA documentation site and resolve it yourself.
    For information on the documentation site, please contact us from "Contact Us".

    2. For the application compiled yourself
    Not supported. Please resolve it yourself.
    See "I would like to use an application not provided by TSUBAME 3".

    Error information is output when compiling with the traceback option.

    Refer to the user guides if Intel or PGI used for compiling.

    If you used Intel or PGI for compilation, please refer to the user guide.
    Debugging by Allinea FORGE is also possible.




  • General

    -The following error occurs immediately after the program runs:

    unable to connect to forwarded X server: Network error: Connection refused
    Error: Can't open display: localhost:13.0.
    Application name: Xt error: Can't open display: 
    Application name: DISPLAY is not set

    The X server configration may be wrong. See the section 1 and 2 of the FAQ "X application doesn't work."
     

    -The GUI program suddenly terminates
    Please check the keep alive setting in the terminal you use. See the FAQ "Session suddenly disconnected while working on TSUBAME3."

    -A job abruptly aborted 
    Although various reasons can be considered, please check the gollowing.
     -Check of batch error file (usually script_name.e.$JOBID )
     -Check program-specific log file
     -Check the free space of the directory

    reference: FAQ

    About common errors in Linux
    "Disk quta exceeded" error is output
    The range of support by T3 Helpdesk about the program error such as segmentation fault
     


    Maple

    -The following error occurs immediately after the program runs:

    Exception in thread "Request id 1" java.lang.UnsupportedOperationException:PERPIXEL_TRANSLUCENT translucency is not supported

    It is due to the compatibility between the application and the X server. See the section 3 of the FAQ "X application doesn't work."
    It has been confirmed that the error occurs with Xming but not occurs with mobaXterm.


    ANSYS

    -The GUI program freezes

    It is due to the compatibility between the application and the X server. See the section 3 of the FAQ "X application doesn't work."
    The trouble occurs with the X server that does not support GL, and it does not occur with the other supporting GL such as ASTEC-X.
    In Ansys R18.2, the operation of GL version Xming (Xming-mesa) has been confirmed.

     Fluent

    -The following error occurs immediately after the program runs:
    Reduce the number of used nodes / processes to be less than the license limit.

    In the example below, we are using an HPC license of 94 tokens(f_node=3)
    In case of 2 f_nodes, error will not occur because it is within limit. 

     Unable to spawn node: license not available.
     ANSYS LICENSE MANAGER ERROR:The request for 94 tasks of feature aa_r_hpc cannot be granted.  Only 16 tasks are available.
     Request name aa_r_hpc does not exist in the licensing pool.
     Checkout request denied as it exceeds the MAX limit specified in the options file.
     Feature:       aa_r_hpc
     License path:  27001@lice0:27001@remote:27001@t3ldap1:
     FlexNet Licensing error:-194,147
     For further information, refer to the FlexNet Licensing documentation,
     available at "www.flexerasoftware.com".

     Hit return to exit.
     The fluent process could not be started.


    ex:Notification to apply the license restriction for ANSYS (Jan. 31)


    ABAQUS

    -The following error occurs immediately after the program runs:

    Error in job ***: Error checking out Abaqus license.

    It is the normal handling at the login node. The use of the solver license is prohibited at the login node.
    ex:Notice of the restriction of using ABAQUS analysis on login node and temporary unavailable of it due to maintenance of the license server

    Please run the job via the batch job scheduler.
    See TSUBAME3.0 User's Guide for the batch job scheduling system.

    -The following message is displayed while the program runs:

    Analysis initiated from SIMULIA established products
    Abaqus JOB intel_int
    Abaqus 3DEXPERIENCE R2017x
    Successfully checked out QEX/103 from DSLS server remote
    Queued for QXT/103
    "QXT" license request queued for the License Server on remote.
    Total time in queue: 60 seconds.
    Position in the queue: 1
    Total time in queue: 30 seconds.
    Position in the queue: 2
    Total time in queue: 91 seconds.

    It is in waiting because of the license shortage.
    Calculation starts as soon as sufficient licenses are secured.
    Note that The TSUBAME points are consumed even in waiting.

    -Abaqus / Explicit generates the following error at parallel execution

    Abaqus Error: Abaqus/Explicit Packager exited with an error - Please see the 
    status file for possible error messages if the file exists.
    Begin MFS->SFS and SIM cleanup
    Fri 17 Aug 2018 10:40:23 AM JST
    Run SMASimUtility
    Fri 17 Aug 2018 10:40:24 AM JST
    End MFS->SFS and SIM cleanup
    Abaqus/Analysis exited with errors

    As a workaround, please read the countermeasure module as follows
    abaqus/2017 Execute the module purge command before executing if it is reading.

    $ module load abaqus/2017_explicit

    COMSOL Multiphysics 

    -The following message is displayed while the program runs
    It occurs with the X server that does not supoort GL. Please use the X server supporting GL or execute with software rendering mode.

    $ comsol
    function is no-op
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGSEGV (0xb) at pc=0x00007fff0f0252bf, pid=10485,
    tid=0x00007fff0e7f9700
    #
    # JRE version: Java(TM) SE Runtime Environment (8.0_112-b15)
    (build1.8.0_112-b15)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed
    modelinux-amd64 compressed oops)
    # Problematic frame:
    # C  [libcs3d_ogl.so+0x2a2bf]
    #
    # Failed to write core dump. Core dumps have been disabled. To enable
    coredumping, try "ulimit -c unlimited" before starting Java again
    #
    # An error report file with more information is saved as:
    # /home/XX/XXXXXX/hs_err_pidXXXXXX.log
    #
    # If you would like to submit a bug report, please visit:
    #   http://bugreport.java.com/bugreport/crash.jsp
    #
    /apps/t3/sles12sp2/isv/comsol/comsol53/multiphysics/bin/comsol: line
    1615:10485 Aborted                 (core dumped)
    ${MPICMD}${FLROOT}/bin/${ARCH}/comsollauncher --launcher.ini
    ${LAUNCHERINIFILE}${LAUNCHERARGS} ${MPILAUNCHERARGS} 


    -The following message is displayed while the program runs

    License error. Wait until the license is available and try again.
    Please check the manual for checking the license.

     /******************/
     /*****Error********/
     /******************/
     Could not obtain license for COMSOL Multiphysics. License error: -4.
     Licensed number of users already reached.
     Feature: COMSOL License path:
     /apps/t3/sles12sp2/isv/comsol/comsol53/multiphysics/license/license.dat:
     FlexNet Licensing error:-4,132 For further information,
     refer to the FlexNet Licensing documentation,
     available at "www.flexerasoftware.com".
     Total time: 4 s.


    Materials Studio

    -The following message is displayed while the program runs
    Please be sure to check the output file as multiple causes can be considered only with the error below.

    The job has failed.
    Download any results generated so far?
    
    (Results files will be permanently removed from Server)

    ex: License error.
    In the case of the example below it is a license error. Wait until the license is available and try again.

    Laboratory etc. TSUBAME is executed by anything other than TSUBAME but it always occurs after the summer maintenance in 2018.
    Reference:License restriction on ISV application usage

    As a countermeasure, please use the batch job scheduler.
    Please check the job scheduling system of TSUBAME3.0 User's Guide.

    Output file (example of CASTEP)

     Job started on host GSIC
     at Thu Aug 16 13:20:06 2018
    
     +-------------------------------------------------+
     |                                                 |
     |      CCC   AA    SSS  TTTTT  EEEEE  PPPP        |
     |     C     A  A  S       T    E      P   P       |
     |     C     AAAA   SS     T    EEE    PPPP        |
     |     C     A  A     S    T    E      P           |
     |      CCC  A  A  SSS     T    EEEEE  P           |
     |                                                 |
     +-------------------------------------------------+
    
     This version was compiled for x86_64-windows-msvc2013 on Dec 07 2016
     Code version: 7217
     Intel(R) Math Kernel Library Version 11.3.1 
     Fundamental constants values: CODATA 2010
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Licensing Error !
    Error: Manual heartbeat setup for MS_castep license failed
    Failed to check out licenses
    Trace stack not available

    Output file (example of Dmol3)

         ===============================================================
         Materials Studio DMol^3 version 2017 R2         
         compiled on Dec  7 2016 22:56:21
         ===============================================================
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    DATE:     Aug 16 13:20:06 2018
     
    Job started on host GSIC
    
    This run uses    1 processors
    Licensing Error !
    Error: Failed to checkout MS_dmol license
    Message: DMol3 job failed
    Error: DMol3 exiting
    Message: License checkin of MS_dmol successful

    ARM

    ex: License error.

    In the case of the example below it is a license error. Wait until the license is available and try again.


    MAP: Your licence does not currently have enough processes available.
    MAP: Requested Processes: 4
    MAP: Available Processes: 3
    Arm Forge 19.0.5 - Arm MAP

    MAP: Your licence does not currently have enough processes available.
    MAP: Requested Processes: 4
    MAP: Available Processes: 2
    MAP: Unable to obtain a valid licence.
    Unable to obtain a valid licence.
    Waiting for seat




  • First of all, please refer to the following.

    FAQ "I would like to use an application not provided by TSUBAME 3"

    1.Installation directory
    You can install in the following two places.
    Please choose suited according to your operation.
    If you need to share within the TSUBAME group such as members of the laboratory, please use the high speed storage area.
    *Even if you change the permissions by chmod or some commands in the home directory, you can not share that.

     -Home directory. (/home/[0^9]/user_account/)
     -High speed storage area, as known as group disk. (/gs/hs[0-1]/TSUBAME_group/)

    Reference: TSUBAME 3.0 User's Guide "3.Strage system"

    2.Installation method

    Please install application according to the manual or README or community forum of the application to be installed.
    Depending on the application, it is necessary to compile the library or module or something from the source file by yourself. 
    The following is a typical installation example.
    *Application management software such as zypper can not be used, you have to compile from the source file basically.

    example 1) executing configure script, generating makefile, then make, make test and make install:
    
    

    $ ./configure --prefix=$HOME/install
    $ make && make test
    $ make install

    example 2)creating a directory for buid, cmake and make install:
    

    $ mkdir build && cd build
    $ cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/install
    $ make install

    example 3)installin with install script:
    

    $ ./install.sh

     

    3.How to install python module
    For the installation of the python module, please check the related URL below.

    related URL
    How to install numpy, mpi4py, chainer etc. using python/3.6.5




  • If you want to install numpy, mpi4py, chainer etc. using python/3.6.5, do as follows.

    $ module purge
    $ module load python/3.6.5
    $ module load intel cuda openmpi
    $ python3 -m pip install --user python_modules

    If you want to specify the version, do:

    $ python3 -m pip install --user python_modules==version

     

    ※When installing a module that uses GPU like cupy etc., please keep computing nodes with qrsh command. before you install it. 

    ※For CuPy,  it is possible to install faster by specifying corresponding cuda version such as cupy-cuda102 when invoking pip install.

    How to install numpy linking intel MKL

    Copy https://github.com/numpy/numpy/blob/master/site.cfg.example to ~/.numpy-site.cfg and edit the item of [mkl] as follows. 

    [mkl]
    library_dirs = /apps/t3/sles12sp2/isv/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64
    include_dirs = /apps/t3/sles12sp2/isv/intel/compilers_and_libraries_2018.1.163/linux/mkl/include
    mkl_libs = mkl_rt

    Then do the following

    $ module load intel python/3.6.5
    $ python3 -m pip install --no-binary :all: --user numpy




  • The following usage restrictions apply when using the application on campus.
    (Excluding Gaussian, AMBER of unrestricted application)

    Please do not occupy applications that have a small number of licenses.
    Be sure to stop the application after the end of application use.

    If a license is found to be occupied for a long period of time, the license may be forcibly recalled without warning. This will result in unstable operation of the application. In some cases, the connection may be blocked. (Only from Lab PCs)

    General

     Longest continuous use period:1 week
     Restriction is per capita usage (excluding Materials Studio)

    ABAQUS

    Execution location limitation
    TSUBAME compute node Solver limits 140 Token
    TSUBAME login node There are restrictions on execution other than CAE
    Laboratory terminal Like login node

    ANSYS

    Execution location limitation
    TSUBAME compute node Base license is limited to 4 Token
    HPC license is limited to 64 Token
    TSUBAME login node Base license is limited to 4 Token
    HPC license can not be executed
    Laboratory terminal Like login node

    Materials Studio

    Execution location limitation
    TSUBAME compute node CASTEP and DMol3 are Number of simultaneous usage 20 Token limit
    (Total number of all users)
    TSUBAME login node Unavailable
    Laboratory terminal CASTEP and DMol3 are Unavailable
    COMPASS is Number of simultaneous usage 4 Token limit
    Multiple start of visualizer is prohibited

    Discovery Studio

    Execution location limitation
    TSUBAME compute node Unavailable
    TSUBAME login node Unavailable
    Laboratory terminal

    CHARMM and CHARMM Lite are Number of simultaneous usage 20 Token limit
    Multiple start of visualizer is prohibited

    Limit of 6 tokens per user (total with MaterialsStudio)

    Related URL
    On suspension of license addition in the busy season of 2018

    Notice of the restriction of using ABAQUS analysis on login node and temporary unavailable of it due to maintenance of the license server

    Notification to apply the license restriction for ANSYS (Jan. 31)

    Software distribution service

    Application software




  • There is a known problem of data corruption by collective communications of GPU buffers on OpenMPI.
    (2019.04) This issue is resolved with the maintenance at the end of FY2018.

    As a workaround, please give the following a try.

    • MPI_Allgather()

    mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_allgather_algorithm 2

    • MPI_Alltoall()

    mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_alltoall_algorithm 3

     

    if the above does not solve the issue, please try the following.

     

    mpirun -mca pml ob1

     

    This issue will be fixed by OPA 10.8 (in the end of the fiscal 2018)

     




  • In TSUBAME3, R - 3.4.1 is available.
    In addition to the basic package, the libraries available as default are as follows.
    Rmpi, rpud, rpudplus

    Please use library() command to check other available libraries.

    If you wish to use a library other than the above, you will need your own installation operation.
    Since the installation directory of R is impossible due to the permission relationship, you can install / manage your own library after specifying the library path. The procedure is as follows.

    Assuming that the library path is $HOME/Rlib, the library name is testlib, and the testlib.tar.gz is the source package, and operate as follows.

    Load modules:
    >module load cuda openmpi r

    Create library installation directory (if nothing):
    >mkdir ~/Rlib

    Downlod package:
    >cd ~/Rlib
    >wget https://cran.r-project.org/src/contrib/testlib.tar.gz

    Install library
    > R CMD INSTALL -l $HOME/Rlib testlib.tar.gz

    Your own installation library settings:
    > export R_LIBS_USER=$HOME/Rlib

    Use your library:
    > R
      library(testlib)




  • Sometimes like the following error occurs when mpi4py.futures.MPIPoolExecutor with openmpi is called.

    [r5i7n2:26205] [[60041,0],0] ORTE_ERROR_LOG: Not found in file orted/pmix/pmix_server_dyn.c at line 87

    If you faced the error, please try ether the one of the following:

    1. mpirun -np <NP> python3 -m mpi4py.futures ./test.py

    2. use mpi4py with intel MPI

     




  • How to configure port forwarding for each terminal software as follows.

    Please try the followings with allocating a compute node by qrsh/qsub.

    As an example, suppose a compute node r7i7n7 is allocated, and connect local PC port 5901 to r7i7n7 port 5901.

     

    1. MobaXterm

    Tunneling -> New SSH Tunnel -> My computer with MobaXterm, input 5901 into "Forwarded port", in SSH server, input login.t3.gsic.titech.ac.jp into "SSH server", input username into "defaultuser", input 22 into "SSH port", in Remote server, input r7i7n7 into "Remote server", input 5901 into "Remote port" and save、choose key icon under Settings tab,  and start the configured tunnel

    mobaxterm

     

    2. OpenSSH/WSL

     

    $  ssh -L 5901:r7i7n7:5901 -i <private key> -f -N <uesrname>@login.t3.gsic.titech.ac.jp

    3. PuTTY

    PuTTY Configuration -> Connection -> SSH -> Tunnels, input 5901 into "Source Port", r7i7n7:5901 into "Destination" and click "Add" and Open

    PuTTY

     4. teraterm

    Setup->SSH forwarding->Add->input 5901 into "Forward local port", input r7i7n7 into "to remote machine", and input 5901 into "port" then click "OK"

    teraterm

     




  • With intel MPI, output might stop by doing background execution like the following.

    mpirun ... ./a.out >& log.txt &

    In this case, it can be avoided by the following:

    mpirun ... ./a.out < /dev/null >& log.txt &




  • If you want to link intel MKL ScaLAPCK、please fill the appropriate contents into https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html, and get the link opthion from "Use this link line”.

     

    example:link with LP64 + dynamic linking + intel MPI + ScaLAPACK

     -L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm -ldl

     

    And if you want custom MPI with BLACS, do

     

    $ module load intel

    $ cp -pr $MKLROOT/interfaces/mklmpi .

    $ cd mklmpi

    $ make libintel64 INSTALL_DIR=.

     

    then you can get the custom BLACS library, please replace -lmkl_blacs_intelmpi_lp64 with it.




  • There is a known issue that sometimes segmentation fault occurs in mpirun/mpiexec.hdyra with intel-mpi/19.6.166 or intel-mpi/19.7.217 by using shared resource type such as h_node, q_node.

     

    $ mpirun -np 1 hostname
    /apps/t3/sles12sp2/isv/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpirun: line 103: 24862 Segmentation fault      mpiexec.hydra -machinefile $machinefile "$@" 0<&0

     

    To avoid this issue, please use f_node resource type or use intel-mpi/19.0.117.




  • Sometimes an error like the following occurs by invoking singularity build with qrsh.

     

    INFO:   Starting build...
    FATAL:   While performing build: conveyor failed to get: error getting username and password: error reading JSON file"/run/user/0/containers/auth.json": open /run/user/0/containers/auth.json:permission denied

     

    As a workaround, do the following after qrsh:

     

    unset XDG_RUNTIME_DIR

     

    and then do singularity build

     

    2020/12/03 update:

    Unsetting XDG_RUNTIME_DIR is now done by the job scheduler, therefore the above is no longer required.




  • If you have a GUI application on TSUBAME and X forwarding fails to draw or the performance is insufficient, TurboVNC may improve the situation.
    Since MobaXterm has a built-in VNC client function, it is relatively easy to use.
    Please refer to User's Guide for how to start a VNC server on a compute node and how to connect it from MobaXterm.




  • If you want to install V8, rstan R packages on TSUBAME, please try the following procedures.

    ※ confirmed with the version V8: 3.4.2 and rstan:2.19.3 .

     There are a lot of dependency packages for that, but they are easy to install it so please download and install dependencies by R CMD INSTALL before trying to install V8 and rstan.

     

    * V8

     

    $ module load gcc cuda openmpi r v8

    $ R CMD INSTALL -l ~/Rlib /path/to/V8_3.4.2.tar.gz

     

    * rstan

     

    $ module load gcc cuda openmpi r

    $ mkdir ~/.R/

    $ vi ~/.R/Makevars  <----edit as follows

    CXX14FLAGS=-O3 -Wno-unused-variable -Wno-unused-function
    CXX14 = g++ -std=c++1y -fPIC

    $ R CMD INSTALL -l ~/Rlib rstan_2.19.3.tar.gz




  • Newer gcc module such as gcc/10.2.0 is required to be able to use C++17 parallel algorithms.

    The following error occurs during the compilation like nvc++ -autopar=gpu ... when using C++ parallel algorithms with nvhpc module and newer gcc module such as gcc/10.2.0.

     

    "/apps/t3/sles12sp2/isv/nvidia/hpc_sdk/Linux_x86_64/21.7/compilers/include-stdpar/thrust/mr/new.h", line 44: error: namespace "std" has no member "align_val_t"
              return ::operator new(bytes, std::align_val_t(alignment));
                                                ^

    "/apps/t3/sles12sp2/isv/nvidia/hpc_sdk/Linux_x86_64/21.7/compilers/include-stdpar/thrust/mr/new.h", line 66: error: namespace "std" has no member "align_val_t"
              ::operator delete(p, bytes, std::align_val_t(alignment));

    If you encounter this error, please try this:

    $ makelocalrc -x -d . -gcc `which gcc` -gpp `which g++`

    $ export NVLOCALRC=$PWD/localrc

     




  • This seems to be a known bug with the combination of macOS Safari and jupyter lab.

     

    https://xchop.blogspot.com/2019/03/macos-jupyter-labterminal.html

    https://qiita.com/qasa/items/b5a6dce179efbf8760dc

     

    Please use another browser(chrome or firefox, etc).




  • An error such that "X fatal error. ***ABAQUS/ABQcaeG rank 0 terminated by signal 6 " occurs at modeling.

    The error seems to occur when ABAQUS CAE is started, and modeling is performed in X transfer with MobaXterm, etc.

    You can avoid this error by using VNC+VirtualGL, so please use it.

     

    Please refer here for more information on how to use VNC from MobaXterm.

    Please refer here for more information on how to use VNC via noVNC.

    Please refer here for more information on how to use VirtualGL from VNC.




  • If the following error:

    Script Error: Microsoft JScript runtime error; Index out of bounds;

    occurs, please try below.

    1. Quit all of Ansys related programs.
    2. Log in to the relevant Linux machine as the relevant general user and execute the following command.

    $ mv ~/.ansys ~/.ansys.bak
    $ mv ~/.config/Ansys ~/.config/Ansys.bak
    $ mv ~/.mw ~/.mw.bak
    $ mv ~/.mw.old ~/.mw.old.bak 

     For Fluent
    $ mv ~/.fluentconf ~/.fluentconf.bak

     For CFX
    $ mv ~/.cfx ~/.cfx.bak


    3. Restart Ansys (After rebooting, a new ".ansys" or "Ansys" folder will be created for each folder in step 2. and your personal settings will be reset).



Tips


    • Acceleration of large scale collective communications for openmpi

    There are four OPA units per node on TSUBAME3、only two is used by default.

    Explicitly using four of them, as shown below, may speed up collective communication with large communication volume, such as MPI_Alltoall() when the number of nodes is large.

    This is especially effective for GPU communication.

    - wrap.sh

    #!/bin/sh

    export NUM_HFIS_PER_NODE=4

    if [ $((OMPI_COMM_WORLD_LOCAL_RANK%NUM_HFIS_PER_NODE)) == 0 ];then
        export HFI_UNIT=0
    elif [ $((OMPI_COMM_WORLD_LOCAL_RANK%NUM_HFIS_PER_NODE)) == 1 ];then
        export HFI_UNIT=1
    elif [ $((OMPI_COMM_WORLD_LOCAL_RANK%NUM_HFIS_PER_NODE)) == 2 ];then
        export HFI_UNIT=2
    elif [ $((OMPI_COMM_WORLD_LOCAL_RANK%NUM_HFIS_PER_NODE)) == 3 ];then
        export HFI_UNIT=3
    fi

    $*

     

    - an example job script(f_node=8、GPUDirect on)

    #!/bin/sh
    #$ -cwd
    #$ -V
    #$ -l h_rt=01:00:00
    #$ -l f_node=8

    . /etc/profile.d/modules.sh
    module purge
    module load cuda openmpi/3.1.4-opa10.10-t3

    mpirun -x PATH -x LD_LIBRARY_PATH -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -npernode 4 -np $((4*8)) ./wrap.sh ./a.out

     

    • Acceleration of pt2 communications for openmpi

    Multirailing (bundling) multiple hfi(opa) may speed up p2p communication.
    This does not seem to be very effective for GPU communication, but seems to be effective for CPU communication.
    To enable multirail, do the following (In the case of openmpi)

    mpirun ... -x PSM2_MULTIRAIL=2 ... ./a.out

    Also, the performance of p2p communication when using multirail seems to depend on the value of PSM2_MQ_RNDV_HFI_WINDOW (default value: 131072, maximum value: 4MB).

    For more details of each parameters, please refer to here.

     

    • About Intra node GPU communication

    As shown in the Hardware Archtecture, GPU0<->GPU2 and GPU1<->GPU3 are connected by two NVLink lines, which doubles the bandwidth.
    If possible, it may speed up your program to have these GPUs communicate with each other.
    ※This can only be achieved with f_node, since h_node allocates only GPU0,1 or GPU2,3.

    • Some tips of openmpi

    * Add rank number into the output

    mpirun -tag-output ...

    * Explicitly specify the algorithm of collective communication(By default, MPI dynamically selects an algorithm from communicator size and message size.)
    ex: MPI_Allreduce()

    mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_allreduce_algorithm <algo #> ...

    Algorithm numbers of Allreduce are: 1 basic linear, 2 nonoverlapping (tuned reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
    If you want to see the detail, please invoke the following:

    ompi_info --param coll tuned --level 9

    or:

    ompi_info -all

    And, if you want to disable tuned collective module, please do:

    mpirun -mca coll ^tuned

    This disables tuned module and switches it with basic module.

    This may cause the performance degradation, but it may useful when tuned module is something buggy.

     

    * Use ssh instead of qrsh -inherit ... which is launched by mpirun(f_node only)

    mpirun -mca plm_rsh_disable_qrsh true -mca plm_rsh_agent ssh ...

    By default, the processes are launched via qrsh -inherit, but if you are having problems with that, try this.

     

    * Show current MCA parameter configurations

    mpirun -mca mpi_show_mca_params 1 ...

    Also, MCA parameters can be passed via environment variable as follows.

    export OMPI_MCA_param_name=value

    where "name" is the variable name, "value" is its value.

     

    * Obtaining core file with openmpi job when segmentation fault occurs

    When you want to get the core file by segmentation fault etc. with openmpi, it seems that the core file is not obtained even if you use ulimit -c unlimited in the job script.
    You can get the core file by wrapping it as follows.

     

    - ulimit.sh

    #!/bin/sh

    ulimit -c unlimited
    $*

    mpirun ... ./ulimit.sh ./a.out

     

    * CPU bindings

    CPU binding options in openmpi are the following.

    mpirun -bind-to <core, socket, numa, board, etc> ...
    mpirun -map-by <foo> ...

    For more detail, please refer to man mpirun.

    If you want to confirm actual binding, try the following:

    mpirun -report-bindings ...

     

    • Some other tips

    * Threshold of GPUDirect of sender side

    Threshold of GPUDirect of sender side is 30000bytes by default.
    Reciever side is UINT_MAX(2^32-1), so if you want GPUDirect to always be ON even when the buffer size is large, you can do the following on the send side.

    mpirun -x PSM2_GPUDIRECT_SEND_THRESH=$((2**32-1)) ...

    For more detail, please refer to here.




    • Job hangs or error occurs when NCCL is used

    Some problem that job hangs or error occurs when NCCL is used are reported.

    In some cases, kernel panic also happens.

    If you suspect that you have hit this problem, try the following

    export NCCL_IB_DISABLE=1

    or

    export NCCL_BUFFSIZE=1048576

    NCCL_IB_DISABLE=1 may decrease the performance, in that case please use NCCL_BUFFSIZE=1048576.

     

    • segmentation fault occurs when MPI+OpenACC is executed

    A problem that segmentation fault occurs when openmpi+OpenACC is used.

    As a workaround, please give it a try:

    export PSM2_MEMORY=large

    or

    export OMPI_MCA_pml=ob1

     

    • Error occurs when GPUDirect is used.

    (2021/08/19 updated)

    This issue has been fixed by the last maintenance.

    Some cases that error occurs when GPUDirect is used and when the program is exited normally or abnormally are reported.

    This error happens rarely.

    This also sometimes triggers kernel panic.

    If you suspect this, please turn off GPUDirect as follows.

    mpirun ... -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=0

     

    • Job hangs with large scale job

    Some problems that mpirun hangs with large scale job are reported before.

    This seems to caused by qrsh -inherit that is fork()'ed by mpirun.

    If you suspect this, please try the following.

    * For openmpi

    mpirun -mca plm_rsh_disable_qrsh true -mca plm_rsh_agent ssh ...

    * For intel MPI

    export I_MPI_HYDRA_BOOTSTRAP=ssh
    unset I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS

    ※Please note that they are effective only for f_node.



Migration from TSUBAME2.5


  • Because the compiler, MPI, and various libraries are different In TSUBAME 3.0 and TSUBAME 2.5, it can not be executed as it is. It is necessary to recompile the program on TSUBAME 3.0.




  • Because the storage connected to TSUBAME 3.0 is different from TSUBAME 2.5, the data stored in / home, / work0, / work1, / data0 of TSUBAME 2.5 can not be accessed directly.

    The data migration procedure from TSUBAME2.0 to TSUBAME 3.0, will be published as soon as prepared




  • This article describes the main difference between the TSUBAME2.5 and TSUBAME3.0.
    Please refer to "TSUBAME 3.0 User's Guide" and "TSUBAME portal User's Guide" from here for details.

    The flat-rate option has been abolished

    Two options were selectable for TSUBAME 2.5: "flat-rate usage (campus only)" and "measured-rate usage (purchase of TSUBAME points)".

    For TSUBAME 3.0, flat rate was abolished. No matter wheather in-campus or not, the option has been unitized into measured rate.

    Payment method has been unified into prepayment

    There were two payment methods in the measured-rate option for TSUBAME 2.5: "prepayment" and "automatic charging (campus only)" .

    For TSUBAME 3.0, automatic charging has been abolished. No matter wheather in-campus or not, payment method has been unitized into prepayment. Points purchased once by prepayment isn't refunded in principle. When purchasing points, please check the group, payment code, number of purchase items (price) in order not to make a mistake.

    Password login to both TSUBAME3 and portal from inside campus is no longer in use

    Until TSUBAME 2.5, it was possible to use the password authentication when logging in from the campus.

    In TSUBAME 3.0, to improve security, you can't log in from the campus, except for some terminals, to perform password authentication. SSH public key registration is required. In addition, when logging in to the TSUBAME portal, there is a method using single sign-on from Tokyo Tech Portal or a temporary URL for login sent by e-mail. password isn't used in both cases.

    Operations requiring passwords with TSUBAME 3.0 are as follows. If you do not use the following functions, you do not need to set a password.

    • Connection to TSUBAME 3 high-speed storage by CIFS
    • Changing login shell with chsh command
    • Login to educational computer system
    • Terminals in some training rooms / Login without using SSH key authentication from TSUBAME 2 (for data migration)
    • Use of some ISV application licenses

    TSUBAME 2.5 has a password expiration period of about half a year, TSUBAME 3.0 does not have a password expiration date. Please pay attention to password management and change as necessary.

    The price and value of TSUBAME points has changed

    Since TSUBAME 2.5 and TSUBAME 3.0 have different performance per node, usage fee per calculation time will differ.
    In TSUBAME 2.5, about 1 node per hour is achieved with 1 point (e.g. calculation of 2 nodes 30 minutes at 1 point), TSUBAME 3.0 calculate about 1 node per secound with 1 point (e.g. calculation of 2 nodes 30 minutes at 3600 points).

    For more information, please refer to rules / usage details.

    ID verification is required when adding members to the group and budget usage of other faculty members


    In TSUBAME 3.0, approval of the relevant user was required when adding members to a group or creating a payment code for the budget for which other faculty staff is responsible for budget. When an authorized user performs the above-mentioned operation, an e-mail for approval request will be sent to the user (in some cases it will be sent after confirmation by GSIC). Please login to TSUBAME portal and access the URL written in the e-mail and approve according to the displayed instructions.




  • This article describes the main difference between the TSUBAME2.5 and TSUBAME3.0. Please refer to "TSUBAME 3.0 User's Guide" and "TSUBAME portal User's Guide" for details.

    What you can do with the login node

    When you logged in to TSUBAME 2.5, the terminal was connected to the interactive node, which is the same constituent node as the compute node, and it was able to compile and debug the application on it.

    In TSUBAME 3.0, the login node is a different configuration from the compute node (e.g. GPU is not connected), so it is not assumed to execute the application including debugging.
    Although it is not a problem to transfer and deploy files and compile small programs at the login node, avoid using loads on login nodes such as debugging and running large-scale programs. Please correspond by using computing nodes interactively by the method described in the next section.

    How to use interactive use

    In order to use interactive use connecting directly to a compute node and inputting commands, please use the following command:
     $ qrsh –g [TSUBAME3 group] -l [resource type]=[number] -l h_rt=[elapsed time]
    Once the compute node is secured, the shell in the state logged in to the compute node will be displayed. When you exit this shell, the interactive usage is terminated and the compute nodes are released.

    If -g is omitted, the TSUBAME points are not consumed, but the restriction such as execution time (within 10 minutes) as "trial execution" is applied.

    When using the GUI application, it can not be executed from the shell executed by qrsh. Therefore, after starting interactive use of the resource type f_node in the above procedure, you can use it by connecting ssh -Y from the login node to the compute node at another terminal which is different from the one where qrsh is running. Please be aware that you need to specify the -Y option both of to the login node from the terminal, and ssh from the login node to the compute node.

    How to run applications (setting of PATH etc.)

    In TSUBAME 2.5, most applications were executable at the time of logging in. executing specified environment setting scripts when switching some applications, versions, or MPI environment respectively I switched the environment. When changing version, changing MPI environment, in some applications, the environment was switched by executing the specified environment setting script.

    when logged in to TSUBAME 3.0, environment variables for most applications are not set. It can be executed by explicitly loading the module file corresponding to the application to be used.
    For example, when using the Intel compiler, CUDA, OpenMPI, it must be set before compiling the application and executing the job as described below:
     $ module load intel cuda openmpi

    Job restrictions
    Please check "Various limit value list" about the current limit.
     

    In TSUBAME 2.5, instead of paying more TSUBAME points, it was permitted to run jobs longer than 24 hours as a premier option. The option to extend execution time has been abolished in TSUBAME 3.0. The maximum execution time for all jobs is 24 hours.

    It was able to perform large reservation execution of 16 nodes or more by using H queue on a daily basis.

    In In TSUBAME 3.0, it is currently in preraration so that reservation execution of 1 node or more, 1 hour unit can be performed.




  • This article describes the main difference between the TSUBAME2.5 and TSUBAME3.0.
    Please refer to "TSUBAME portal User's Guide" for node reservation method and "TSUBAME 3.0 User's Guide" for job submission method to reserved node.

    In addition, some setting and  limit values may be updated, taking into consideration the system usage situation.We will announce you when changing the setting, so please periodically check the latest notice.
     

    Small reservation are easier take  as better than before.

    In the H queue of TSUBAME 2.5, reservations could only be made with 16 node or more, 1 day unit and large scale execution, but in TSUBAME 3.0, it is possible to reserve one node or more per hour In addition to large-scale execution, it can be used for long-term execution, etc.

    Reservation relationship limit value list of TSUBAME 3.0
    Please check "Various limit value list" about the current limit.

    Maximum number of reserved nodes: 135nodes (October-March), 270nodes (April-September)
    Reservation time length: 1hour~96hours(4 days)(October-March), 1hour~168hours(7 days)(April-September)
    Total number of reservation slots that one group can at once: 6480nodes-hours(October-March), 12960nodes-hours(April-September)

    About the time that a job can actually be executed

    In TSUBAME 2.5 we were able to occupy the node from 10 am on the reservation start date until 9 am on the reservation end date.
    In TSUBAME 3.0, the node can be used from the reservation start time to 5 minutes before the reservation end time, and all jobs are stopped 5 minutes before the end time.

    On submitting jobs to reserved nodes

    By adding " -ar reservation number "to the arguments of qsub, qrsh etc., you can submit the job to the reserved node.(You can submit a job before the start time of reservation slots)
    Please note that if you do not specify " -ar reservation number ", you will consume points and execute the job outside the reservation slots."
    Even if you are using a resource type other than f_node, please be aware that you can not submit jobs with more than parallel number of reserved nodes.For example, h_node 40 parallelism can not be executed when 20 nodes are reserved. You can run two parallel jobs at the same time.

    About SSH / direct login to reserved node

    In TSUBAME 2.5, members of the TSUBAME group who made the reservation were able to execute SSH in order to calculate with the reserved node, and execute directly without going through the scheduler.
    With TSUBAME 3.0, only users submitting jobs can perform SSH only when submitting f_node jobs.To execute the program directly, create a job with the required number of nodes with f_node, or log in with qrsh.

    Attention on appointment just before the start time

    For TSUBAME 2.5, the point consumption of the reservation starting within one week was constant.
    In TSUBAME 3.0, Reservations to start within 24 hours is four times higher than Reservation for more than 24 hours (within 2 weeks).
    This is to avoid affecting jobs other than the reservation that has already been submitted.
    In addition, since the node used for reservation and the node used for jobs other than reservation are shared, there is a high possibility that reservations within 24 hours can not be secured depending on job execution status.
    For large-scale execution, prepare in advance and recommend early reservation.

    Note If you decide to cancel the reservation

    In TSUBAME 2.5, TSUBAME points consumed when deletion of reservation was fully points returned, but TSUBAME 3.0 only returns TSUBAME points up to half price except for the following reasons when dealing reservation.

    • Case :Cancellation within 5 minutes after making reservation
    • Canceled without user's responsibility, such as system maintenance

    Reserving a node makes it harder for other jobs to be executed in order to reserve a compute node at the reserved time.Please confirm the reservation contents carefully when reserving the node and reserve only the necessary amount.