Check the detail of an error message printed the log file

The following message may be printed to the log file in some case.

/var/spool/uge/hostname/job_scripts/JOB-ID: line XX: Process-ID Killed Program_Name

In this case, type the qacct command to check the job in detail.

$ qacct -j JOB-ID

The following is an output example of the qacct command. (Excerpt)

==============================================================

1.Example when the memory resource is exceeded

$ qacct -j 4500000

qname        all.q               
hostname     r0i0n0              
group        GSIC          
owner        GSICUSER00            
project      NONE                
department   defaultdepartment   
jobname      SAMPLE.sh
jobnumber    4500000             
taskid       undefined
account      0 0 1 0 0 0 3600 0 0 0 0 0 0
priority     0      
cwd          /path-to-current
submit_host  login0 or login1    
submit_cmd   qsub -A GSICGROUP SAMPLE.sh
qsub_time    %M/%D/%Y %H:%M:%S.%3N
start_time   %M/%D/%Y %H:%M:%S.%3N
end_time     %M/%D/%Y %H:%M:%S.%3N
granted_pe   mpi_q_node          
slots        7                   
failed       0    
deleted_by   NONE
exit_status  137                              
maxvmem      120.000G
maxrss       0.000
maxpss       0.000
arid         undefined
jc_name      NONE

you need pay atttention to exit_status, accont and maxvmem in the example.
exit_status provides the cause of the error by exit code. The exit_status 137 indicates 128 + 9, but since the status occurs in various problem, you may not determine.

Then check the granted_pe and maxvmem.

The "0 0 1 0 0 0" of account shows which resource type and how much it used.
The space type indicates the resource type of f_node, h_node, q_node, s_core, q_core, s_gpu, and the number indicates the resource amount. In this example, one q_node is used.
The maximum memory usage, respectively.
It is estimated that 120 GB of memory was about to be actually used although up to 60 GB is available in q_node according to the User's Guide
 

In TSUBAME, the job is killed automatically if the job used more memory size than assigned.


2.Example when the reserved time is exceeded

$ qacct -j 50000000
qname        all.q               
hostname     r0i0n0              
group        GSIC          
owner        GSICUSER00            
project      NONE                
department   defaultdepartment   
jobname      SAMPLE.sh
jobnumber    50000000             
taskid       undefined
account      0 0 1 0 0 0 3600 0 0 0 0 0 0
priority     0      
cwd          /path-to-current
submit_host  login0 or login1    
submit_cmd   qsub -A GSICGROUP SAMPLE.sh
qsub_time    %M/%D/%Y %H:%M:%S.%3N
start_time   %M/%D/%Y %H:%M:%S.%3N
end_time     %M/%D/%Y %H:%M:%S.%3N
granted_pe   mpi_f_node          
slots        7                   
failed       0    
deleted_by   NONE
exit_status  137
wallclock    614.711                              
maxvmem      12.000G
maxrss       0.000
maxpss       0.000
arid         undefined
jc_name      NONE

you need pay atttention to exit_status, wallclock in the example.
exit_status provides the cause of the error by exit code. The exit_status 137 indicates 128 + 9, but since the status occurs in various problem, you may not determine.

So I will focus on account, wallclock.
The seventh digit of the account space break indicates the time (sec) for securing resources.
In this example it is 600 seconds.

Wallclock shows the elapsed time, which is 614 seconds in this example.

Since the calculation did not end within the resource securing time, it can be inferred that the job was forcibly terminated



Related URL: About common errors in Linux