[GE users] Strange behavior with tight integration: no free queue for job

jlopez jlopez at cesga.es
Wed Nov 19 17:23:44 GMT 2008


Hi all,

Today we have seen an strange issue with an mpi job that uses tight 
integration. The job started but after less than 5 seconds it finished.

The messages in the qmaster are the following:

11/19/2008 10:04:57|qmaster|cn142|E|execd at cn099.null reports running job (716631.1/1.cn099) in queue "large_queue at cn099.null" that was not supposed to be there - killing
11/19/2008 10:05:37|qmaster|cn142|E|execd at cn099.null reports running job (716631.1/1.cn099) in queue "large_queue at cn099.null" that was not supposed to be there - killing


And looking at the log of the node that is running the "non-existing" 
job we see:

11/19/2008 10:04:25|execd|cn099|E|no free queue for job 716631 of user jlopez at cn032.null (localhost = cn099.null)

11/19/2008 10:04:25|execd|cn099|E|no free queue for job 716631 of user jlopez at cn032.null (localhost = cn099.null)

11/19/2008 10:04:25|execd|cn099|E|no free queue for job 716631 of user jlopez at cn032.null (localhost = cn099.null)

11/19/2008 10:04:57|execd|cn099|E|can't remove directory "active_jobs/716631.1": ==================== recursive_rmdir() failed

11/19/2008 10:05:37|execd|cn099|E|ja-task "716631.1" is unknown - reporting it to qmaster

11/19/2008 10:06:17|execd|cn099|E|acknowledge for unknown job 716631.1/master

11/19/2008 10:06:17|execd|cn099|E|incorrect config file for job 716631.1

11/19/2008 10:06:17|execd|cn099|E|can't remove directory "active_jobs/716631.1": opendir(active_jobs/716631.1) failed: No such file or directory


Analysing the situation looking at the output of the job we see that the 
job started in the MASTER node and tried to launch all its slave 
processes using qrsh. For some unknown reason the node cn099 was unable 
to schedule the qrsh because of "no free queue" (I do not understand 
why) and the job failed. One minute later it seems that the qrsh process 
started in the cn 099 SLAVE node and the qmaster saw it and decided to 
kill it because the job had already finished.

Do you know what could be the reason of a "no free queue for job" in a 
slave node when the task is submitted via qrsh -inherit?

Thanks in advance,
Javier

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89134

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, Text/X-VCARD (charset: UTF-8 "Internet-standard Unicode") ]
    [ (Name: "jlopez.vcf") 14 lines. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list