The following message may be printed to the log file in some case.
/var/spool/uge/hostname/job_scripts/JOB-ID: line XX: Process-ID Killed Program_Name |
In this case, type the qacct command to check the job in detail.
$ qacct -j JOB-ID |
The following is an output example of the qacct command. (Excerpt)
==============================================================
1.Example when the memory resource is exceeded
$ qacct -j 4500000 qname all.q |
you need pay atttention to exit_status, accont and maxvmem in the example.
exit_status provides the cause of the error by exit code. The exit_status 137 indicates 128 + 9, but since the status occurs in various problem, you may not determine.
Then check the granted_pe and maxvmem.
The "0 0 1 0 0 0" of account shows which resource type and how much it used.
The space type indicates the resource type of f_node, h_node, q_node, s_core, q_core, s_gpu, and the number indicates the resource amount. In this example, one q_node is used.
The maximum memory usage, respectively.
It is estimated that 120 GB of memory was about to be actually used although up to 60 GB is available in q_node according to the User's Guide.
In TSUBAME, the job is killed automatically if the job used more memory size than assigned.
2.Example when the reserved time is exceeded
$ qacct -j 50000000 |
you need pay atttention to exit_status, wallclock in the example.
exit_status provides the cause of the error by exit code. The exit_status 137 indicates 128 + 9, but since the status occurs in various problem, you may not determine.
So I will focus on account, wallclock.
The seventh digit of the account space break indicates the time (sec) for securing resources.
In this example it is 600 seconds.
Wallclock shows the elapsed time, which is 614 seconds in this example.
Since the calculation did not end within the resource securing time, it can be inferred that the job was forcibly terminated
Related URL: About common errors in Linux