[GE users] Strange behavior with tight integration: no free queue for job

reuti reuti at staff.uni-marburg.de
Fri Nov 21 13:48:03 GMT 2008


Am 20.11.2008 um 16:29 schrieb jlopez:

> reuti wrote:
>>
>> Hi Javier, Am 20.11.2008 um 11:13 schrieb jlopez:
>>>
>>> Hi Reuti, reuti wrote:
>>>>
>>>> Hi, Am 19.11.2008 um 18:23 schrieb jlopez:
>>>>>
>>>>> Hi all, Today we have seen an strange issue with an mpi
>>>> which MPI implementation?
>>> HP-MPI
>> which version - at least 2.2.5?
>  Yes, we are using: HP MPI 02.02.05.01 Linux IA64
>>>>> job that uses tight integration. The job started but after less  
>>>>> than 5 seconds it finished.
>>>> What mpirun/mpiexec syntax did you use?
>>> The mpirun is launched with the following options: mpirun -prot - 
>>> hostfile $TMPDIR/machines and the environment variable MPI_REMSH  
>>> is set to "$TMPDIR/rsh" to use the tight integration. The rsh  
>>> wrapper prints the qrsh commands it launches and this were the  
>>> qrsh executed:
>> Is this just one mpirun below?
> Probably not, in this case the mpirun is called indirectly by a  
> Berkely UPC program so it seems internally it generates several  
> calls to mpirun.
>> Usually HP-MPI collects the tasks for every node (even when there  
>> are several lines) and makes only one time rsh, the others are  
>> created as threads. (At least: this is my observation - we have  
>> only executables of our applications with embedded HP-MPI.)
> Yes, I have seen the same behavior when running HP-MPI directly,  
> just one rsh per node. In this case is a bit tricky because mpirun  
> calls are generated internaly by UPC.
>> master node=10.128.1.32 slave nodes=4*10.128.1.12 /  
>> 4*10.128.1.40 / 1*10.128.1.99 This was the intended allocation  
>> with 10 slots?
> The actual job allocation (according to the logs) were 4 nodes of  
> num_proc 8: .32, .12, .40 and .99. Where .32 was the master and the  
> other were slaves (the job requested 4 mpi slots with num_proc=8)
>
> The comment you do about the connections is very interesting. I do  
> not know why it does 4 connections to each node (it could be a  
> reconnection attemp) but it is very surprising that the .99 is the  
> only node were it only does 1 connection. The only reason I can see  
> is the problem of "no free queues" in this node. Do you know what  
> happens when you get a "no free queues" message in one execd? Is  
> the qrsh command hanged in the master until it is scheduled?

As many mpirun are used in your setup, maybe a previous task (which  
should have already left the node) was still active on the node cn099.

Is this happening all the time or only for certain jobs?

-- Reuti


> Thanks,
> Javier
>> -- Reuti
>>>
>>> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v LOADEDMODULES=icc/ 
>>> 9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v OMP_NUM_THREADS=8  
>>> -inherit -nostdin 10.128.1.122 /opt/hpmpi/bin/ mpid 2 0 33686785  
>>> 10.128.1.32 60303 26917 /opt/hpmpi /opt/cesga/sge61/bin/lx24-ia64/ 
>>> qrsh -v LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/ 
>>> 9:hp-mpi -v OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.122 /opt/ 
>>> hpmpi/bin/ mpid 6 0 33686785 10.128.1.32 60303 26917 /opt/hpmpi / 
>>> opt/cesga/sge61/bin/lx24-ia64/qrsh -v LOADEDMODULES=icc/ 
>>> 9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v OMP_NUM_THREADS=8  
>>> -inherit -nostdin 10.128.1.122 /opt/hpmpi/bin/ mpid 10 0 33686785  
>>> 10.128.1.32 60303 26917 /opt/hpmpi /opt/cesga/sge61/bin/lx24-ia64/ 
>>> qrsh -v LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/ 
>>> 9:hp-mpi -v OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.40 /opt/ 
>>> hpmpi/bin/mpid 13 0 33686785 10.128.1.32 60303 26917 /opt/hpmpi / 
>>> opt/cesga/sge61/bin/lx24-ia64/qrsh -v LOADEDMODULES=icc/ 
>>> 9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v OMP_NUM_THREADS=8  
>>> -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid 1 0 33686785  
>>> 10.128.1.32 60303 26917 /opt/hpmpi /opt/cesga/sge61/bin/lx24-ia64/ 
>>> qrsh -v LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/ 
>>> 9:hp-mpi -v OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.99 /opt/ 
>>> hpmpi/bin/mpid 15 0 33686785 10.128.1.32 60303 26917 /opt/hpmpi / 
>>> opt/cesga/sge61/bin/lx24-ia64/qrsh -v LOADEDMODULES=icc/ 
>>> 9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v OMP_NUM_THREADS=8  
>>> -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid 5 0 33686785  
>>> 10.128.1.32 60303 26917 /opt/hpmpi /opt/cesga/sge61/bin/lx24-ia64/ 
>>> qrsh -v LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/ 
>>> 9:hp-mpi -v OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.122 /opt/ 
>>> hpmpi/bin/ mpid 14 0 33686785 10.128.1.32 60303 26917 /opt/hpmpi / 
>>> opt/cesga/sge61/bin/lx24-ia64/qrsh -v LOADEDMODULES=icc/ 
>>> 9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v OMP_NUM_THREADS=8  
>>> -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid 9 0 33686785  
>>> 10.128.1.32 60303 26917 /opt/hpmpi It seems the one corresponding  
>>> to 10.128.1.99 (cn099) did not respond on time and the job  
>>> exited. The rest of qrsh responded because I can see the  
>>> corresponding accounting lines in the log.
>>>>
>>>> - was the machinefile honored?
>>> Yes, it was available as it shows the fact that the qrsh commands  
>>> were launched the 4 nodes of the job.
>>>>
>>>> - what is the value of job_is_first_task in the PE setting?
>>> job_is_first_task FALSE
>>>>
>>>> -- Reuti
>>> Thanks, Javier
>>>>
>>>>> The messages in the qmaster are the following: 11/19/2008  
>>>>> 10:04:57|qmaster|cn142|E|execd at cn099.null reports running job  
>>>>> (716631.1/1.cn099) in queue "large_queue at cn099.null" that was  
>>>>> not supposed to be there - killing 11/19/2008 10:05:37| qmaster| 
>>>>> cn142|E|execd at cn099.null reports running job (716631.1/1.cn099)  
>>>>> in queue "large_queue at cn099.null" that was not supposed to be  
>>>>> there - killing And looking at the log of the node that is  
>>>>> running the "non- existing" job we see: 11/19/2008 10:04:25| 
>>>>> execd|cn099|E|no free queue for job 716631 of user  
>>>>> jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:25|  
>>>>> execd|cn099|E|no free queue for job 716631 of user  
>>>>> jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:25|  
>>>>> execd|cn099|E|no free queue for job 716631 of user  
>>>>> jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:57|  
>>>>> execd|cn099|E|can't remove directory "active_jobs/716631.1":  
>>>>> ==================== recursive_rmdir() failed 11/19/2008  
>>>>> 10:05:37| execd|cn099|E|ja-task "716631.1" is unknown -  
>>>>> reporting it to qmaster 11/19/2008 10:06:17|execd|cn099|E| 
>>>>> acknowledge for unknown job 716631.1/master 11/19/2008 10:06:17| 
>>>>> execd|cn099|E|incorrect config file for job 716631.1 11/19/2008  
>>>>> 10:06:17|execd|cn099|E| can't remove directory "active_jobs/ 
>>>>> 716631.1": opendir(active_jobs/ 716631.1) failed: No such file  
>>>>> or directory Analysing the situation looking at the output of  
>>>>> the job we see that the job started in the MASTER node and  
>>>>> tried to launch all its slave processes using qrsh. For some  
>>>>> unknown reason the node cn099 was unable to schedule the qrsh  
>>>>> because of "no free queue" (I do not understand why) and the  
>>>>> job failed. One minute later it seems that the qrsh process  
>>>>> started in the cn 099 SLAVE node and the qmaster saw it and  
>>>>> decided to kill it because the job had already finished. Do you  
>>>>> know what could be the reason of a "no free queue for job" in a  
>>>>> slave node when the task is submitted via qrsh -inherit? Thanks  
>>>>> in advance, Javier <jlopez.vcf>
>>>> ------------------------------------------------------ http:// 
>>>> gridengine.sunsource.net/ds/viewMessage.do?  
>>>> dsForumId=38&dsMessageId=89143 To unsubscribe from this  
>>>> discussion, e-mail: [users- unsubscribe at gridengine.sunsource.net].
>>> ------------------------------------------------------ http:// 
>>> gridengine.sunsource.net/ds/viewMessage.do?  
>>> dsForumId=38&dsMessageId=89181 To unsubscribe from this  
>>> discussion, e-mail: [users-  
>>> unsubscribe at gridengine.sunsource.net].<jlopez.vcf>
>> ------------------------------------------------------ http:// 
>> gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=89197 To unsubscribe from this  
>> discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
> <jlopez.vcf>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89367

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list