[GE users] Strange behavior with tight integration: no free queue for job

reuti reuti at staff.uni-marburg.de
Thu Nov 20 11:32:49 GMT 2008


Hi Javier,


Am 20.11.2008 um 11:13 schrieb jlopez:

> Hi Reuti,
>
> reuti wrote:
>> Hi,
>>
>> Am 19.11.2008 um 18:23 schrieb jlopez:
>>
>>
>>> Hi all,
>>>
>>> Today we have seen an strange issue with an mpi
>>>
>>
>> which MPI implementation?
>>
>>
> HP-MPI

which version - at least 2.2.5?

>>> job that uses tight integration. The job started but after less
>>> than 5 seconds it finished.
>>>
>>
>> What mpirun/mpiexec syntax did you use?
>>
>>
> The mpirun is launched with the following options:
> mpirun -prot -hostfile $TMPDIR/machines
>
> and the environment variable MPI_REMSH is set to "$TMPDIR/rsh" to use
> the tight integration.
>
> The rsh wrapper prints the qrsh commands it launches and this were the
> qrsh executed:

Is this just one mpirun below? Usually HP-MPI collects the tasks for  
every node (even when there are several lines) and makes only one  
time rsh, the others are created as threads. (At least: this is my  
observation - we have only executables of our applications with  
embedded HP-MPI.)

master node=10.128.1.32
slave nodes=4*10.128.1.12 / 4*10.128.1.40 / 1*10.128.1.99

This was the intended allocation with 10 slots?

-- Reuti


> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.122 /opt/hpmpi/bin/ 
> mpid 2 0
> 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>
> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.122 /opt/hpmpi/bin/ 
> mpid 6 0
> 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>
> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.122 /opt/hpmpi/bin/ 
> mpid 10
> 0 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>
> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid  
> 13 0
> 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>
> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid  
> 1 0
> 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>
> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.99 /opt/hpmpi/bin/mpid  
> 15 0
> 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>
> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid  
> 5 0
> 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>
> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.122 /opt/hpmpi/bin/ 
> mpid 14
> 0 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>
> /opt/cesga/sge61/bin/lx24-ia64/qrsh -v
> LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
> OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid  
> 9 0
> 33686785 10.128.1.32 60303 26917 /opt/hpmpi
>
> It seems the one corresponding to 10.128.1.99 (cn099) did not  
> respond on
> time and the job exited. The rest of qrsh responded because I can see
> the corresponding accounting lines in the log.
>> - was the machinefile honored?
>>
> Yes, it was available as it shows the fact that the qrsh commands were
> launched the 4 nodes of the job.
>> - what is the value of job_is_first_task in the PE setting?
>>
> job_is_first_task FALSE
>
>> -- Reuti
>>
>>
> Thanks,
> Javier
>>
>>> The messages in the qmaster are the following:
>>> 11/19/2008 10:04:57|qmaster|cn142|E|execd at cn099.null reports
>>> running job (716631.1/1.cn099) in queue "large_queue at cn099.null"
>>> that was not supposed to be there - killing 11/19/2008 10:05:37|
>>> qmaster|cn142|E|execd at cn099.null reports running job
>>> (716631.1/1.cn099) in queue "large_queue at cn099.null" that was not
>>> supposed to be there - killing
>>> And looking at the log of the node that is running the "non-
>>> existing" job we see:
>>> 11/19/2008 10:04:25|execd|cn099|E|no free queue for job 716631 of
>>> user jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:25|
>>> execd|cn099|E|no free queue for job 716631 of user
>>> jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:25|
>>> execd|cn099|E|no free queue for job 716631 of user
>>> jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:57|
>>> execd|cn099|E|can't remove directory "active_jobs/716631.1":
>>> ==================== recursive_rmdir() failed 11/19/2008 10:05:37|
>>> execd|cn099|E|ja-task "716631.1" is unknown - reporting it to
>>> qmaster 11/19/2008 10:06:17|execd|cn099|E|acknowledge for unknown
>>> job 716631.1/master 11/19/2008 10:06:17|execd|cn099|E|incorrect
>>> config file for job 716631.1 11/19/2008 10:06:17|execd|cn099|E|
>>> can't remove directory "active_jobs/716631.1": opendir(active_jobs/
>>> 716631.1) failed: No such file or directory
>>> Analysing the situation looking at the output of the job we see
>>> that the job started in the MASTER node and tried to launch all its
>>> slave processes using qrsh. For some unknown reason the node cn099
>>> was unable to schedule the qrsh because of "no free queue" (I do
>>> not understand why) and the job failed. One minute later it seems
>>> that the qrsh process started in the cn 099 SLAVE node and the
>>> qmaster saw it and decided to kill it because the job had already
>>> finished.
>>>
>>> Do you know what could be the reason of a "no free queue for job"
>>> in a slave node when the task is submitted via qrsh -inherit?
>>>
>>> Thanks in advance,
>>> Javier
>>> <jlopez.vcf>
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=89143
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=89181
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].<jlopez.vcf>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89197

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list