[GE users] Strange behavior with tight integration: no free queue for job

jlopez jlopez at cesga.es
Thu Nov 20 15:29:05 GMT 2008


reuti wrote:
> Hi Javier,
>
>
> Am 20.11.2008 um 11:13 schrieb jlopez:
>
>   
>> Hi Reuti,
>>
>> reuti wrote:
>>     
>>> Hi,
>>>
>>> Am 19.11.2008 um 18:23 schrieb jlopez:
>>>
>>>
>>>       
>>>> Hi all,
>>>>
>>>> Today we have seen an strange issue with an mpi
>>>>
>>>>         
>>> which MPI implementation?
>>>
>>>
>>>       
>> HP-MPI
>>     
>
> which version - at least 2.2.5?
>
>   
 Yes, we are using: HP MPI 02.02.05.01 Linux IA64
>>>> job that uses tight integration. The job started but after less
>>>> than 5 seconds it finished.
>>>>
>>>>         
>>> What mpirun/mpiexec syntax did you use?
>>>
>>>
>>>       
>> The mpirun is launched with the following options:
>> mpirun -prot -hostfile $TMPDIR/machines
>>
>> and the environment variable MPI_REMSH is set to "$TMPDIR/rsh" to use
>> the tight integration.
>>
>> The rsh wrapper prints the qrsh commands it launches and this were the
>> qrsh executed:
>>     
>
> Is this just one mpirun below? 
Probably not, in this case the mpirun is called indirectly by a Berkely 
UPC program so it seems internally it generates several calls to mpirun.
> Usually HP-MPI collects the tasks for  
> every node (even when there are several lines) and makes only one  
> time rsh, the others are created as threads. (At least: this is my  
> observation - we have only executables of our applications with  
> embedded HP-MPI.)
>
>   
Yes, I have seen the same behavior when running HP-MPI directly, just 
one rsh per node. In this case is a bit tricky because mpirun calls are 
generated internaly by UPC.
> master node=10.128.1.32
> slave nodes=4*10.128.1.12 / 4*10.128.1.40 / 1*10.128.1.99
>
> This was the intended allocation with 10 slots?
>   
The actual job allocation (according to the logs) were 4 nodes of 
num_proc 8: .32, .12, .40 and .99. Where .32 was the master and the 
other were slaves (the job requested 4 mpi slots with num_proc=8)

The comment you do about the connections is very interesting. I do not 
know why it does 4 connections to each node (it could be a reconnection 
attemp) but it is very surprising that the .99 is the only node were it 
only does 1 connection. The only reason I can see is the problem of "no 
free queues" in this node. Do you know what happens when you get a "no 
free queues" message in one execd? Is the qrsh command hanged in the 
master until it is scheduled?

Thanks,
Javier
> -- Reuti
>
>
>   
>> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
>> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
>> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.122 /opt/hpmpi/bin/ 
>> mpid 2 0
>> 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>>
>> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
>> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
>> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.122 /opt/hpmpi/bin/ 
>> mpid 6 0
>> 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>>
>> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
>> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
>> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.122 /opt/hpmpi/bin/ 
>> mpid 10
>> 0 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>>
>> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
>> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
>> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid  
>> 13 0
>> 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>>
>> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
>> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
>> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid  
>> 1 0
>> 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>>
>> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
>> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
>> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.99 /opt/hpmpi/bin/mpid  
>> 15 0
>> 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>>
>> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
>> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
>> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid  
>> 5 0
>> 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>>
>> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
>> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
>> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.122 /opt/hpmpi/bin/ 
>> mpid 14
>> 0 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>>
>> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
>> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
>> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid  
>> 9 0
>> 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>>
>> It seems the one corresponding to 10.128.1.99 (cn099) did not  
>> respond on
>> time and the job exited. The rest of qrsh responded because I can see
>> the corresponding accounting lines in the log.
>>     
>>> - was the machinefile honored?
>>>
>>>       
>> Yes, it was available as it shows the fact that the qrsh commands were
>> launched the 4 nodes of the job.
>>     
>>> - what is the value of job_is_first_task in the PE setting?
>>>
>>>       
>> job_is_first_task FALSE
>>
>>     
>>> -- Reuti
>>>
>>>
>>>       
>> Thanks,
>> Javier
>>     
>>>> The messages in the qmaster are the following:
>>>> 11/19/2008 10:04:57|qmaster|cn142|E|execd at cn099.null reports
>>>> running job (716631.1/1.cn099) in queue "large_queue at cn099.null"
>>>> that was not supposed to be there - killing 11/19/2008 10:05:37|
>>>> qmaster|cn142|E|execd at cn099.null reports running job
>>>> (716631.1/1.cn099) in queue "large_queue at cn099.null" that was not
>>>> supposed to be there - killing
>>>> And looking at the log of the node that is running the "non-
>>>> existing" job we see:
>>>> 11/19/2008 10:04:25|execd|cn099|E|no free queue for job 716631 of
>>>> user jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:25|
>>>> execd|cn099|E|no free queue for job 716631 of user
>>>> jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:25|
>>>> execd|cn099|E|no free queue for job 716631 of user
>>>> jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:57|
>>>> execd|cn099|E|can't remove directory "active_jobs/716631.1":
>>>> ==================== recursive_rmdir() failed 11/19/2008 10:05:37|
>>>> execd|cn099|E|ja-task "716631.1" is unknown - reporting it to
>>>> qmaster 11/19/2008 10:06:17|execd|cn099|E|acknowledge for unknown
>>>> job 716631.1/master 11/19/2008 10:06:17|execd|cn099|E|incorrect
>>>> config file for job 716631.1 11/19/2008 10:06:17|execd|cn099|E|
>>>> can't remove directory "active_jobs/716631.1": opendir(active_jobs/
>>>> 716631.1) failed: No such file or directory
>>>> Analysing the situation looking at the output of the job we see
>>>> that the job started in the MASTER node and tried to launch all its
>>>> slave processes using qrsh. For some unknown reason the node cn099
>>>> was unable to schedule the qrsh because of "no free queue" (I do
>>>> not understand why) and the job failed. One minute later it seems
>>>> that the qrsh process started in the cn 099 SLAVE node and the
>>>> qmaster saw it and decided to kill it because the job had already
>>>> finished.
>>>>
>>>> Do you know what could be the reason of a "no free queue for job"
>>>> in a slave node when the task is submitted via qrsh -inherit?
>>>>
>>>> Thanks in advance,
>>>> Javier
>>>> <jlopez.vcf>
>>>>
>>>>         
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessageId=89143
>>>
>>> To unsubscribe from this discussion, e-mail: [users- 
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>>       
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=89181
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].<jlopez.vcf>
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89197
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89237

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, Text/X-VCARD (charset: UTF-8 "Internet-standard Unicode") ]
    [ (Name: "jlopez.vcf") 14 lines. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list