[GE users] Strange behavior with tight integration: no free queue for job

reuti reuti at staff.uni-marburg.de
Wed Nov 19 19:36:38 GMT 2008


Hi,

Am 19.11.2008 um 18:23 schrieb jlopez:

> Hi all,
>
> Today we have seen an strange issue with an mpi

which MPI implementation?

> job that uses tight integration. The job started but after less  
> than 5 seconds it finished.

What mpirun/mpiexec syntax did you use?

- was the machinefile honored?
- what is the value of job_is_first_task in the PE setting?

-- Reuti


> The messages in the qmaster are the following:
> 11/19/2008 10:04:57|qmaster|cn142|E|execd at cn099.null reports  
> running job (716631.1/1.cn099) in queue "large_queue at cn099.null"  
> that was not supposed to be there - killing 11/19/2008 10:05:37| 
> qmaster|cn142|E|execd at cn099.null reports running job  
> (716631.1/1.cn099) in queue "large_queue at cn099.null" that was not  
> supposed to be there - killing
> And looking at the log of the node that is running the "non- 
> existing" job we see:
> 11/19/2008 10:04:25|execd|cn099|E|no free queue for job 716631 of  
> user jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:25| 
> execd|cn099|E|no free queue for job 716631 of user  
> jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:25| 
> execd|cn099|E|no free queue for job 716631 of user  
> jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:57| 
> execd|cn099|E|can't remove directory "active_jobs/716631.1":  
> ==================== recursive_rmdir() failed 11/19/2008 10:05:37| 
> execd|cn099|E|ja-task "716631.1" is unknown - reporting it to  
> qmaster 11/19/2008 10:06:17|execd|cn099|E|acknowledge for unknown  
> job 716631.1/master 11/19/2008 10:06:17|execd|cn099|E|incorrect  
> config file for job 716631.1 11/19/2008 10:06:17|execd|cn099|E| 
> can't remove directory "active_jobs/716631.1": opendir(active_jobs/ 
> 716631.1) failed: No such file or directory
> Analysing the situation looking at the output of the job we see  
> that the job started in the MASTER node and tried to launch all its  
> slave processes using qrsh. For some unknown reason the node cn099  
> was unable to schedule the qrsh because of "no free queue" (I do  
> not understand why) and the job failed. One minute later it seems  
> that the qrsh process started in the cn 099 SLAVE node and the  
> qmaster saw it and decided to kill it because the job had already  
> finished.
>
> Do you know what could be the reason of a "no free queue for job"  
> in a slave node when the task is submitted via qrsh -inherit?
>
> Thanks in advance,
> Javier
> <jlopez.vcf>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89143

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list