[GE users] Strange behavior with tight integration: no free queue for job

jlopez jlopez at cesga.es
Thu Nov 20 10:13:30 GMT 2008


Hi Reuti,
 
reuti wrote:
> Hi,
>
> Am 19.11.2008 um 18:23 schrieb jlopez:
>
>   
>> Hi all,
>>
>> Today we have seen an strange issue with an mpi
>>     
>
> which MPI implementation?
>
>   
HP-MPI
>> job that uses tight integration. The job started but after less  
>> than 5 seconds it finished.
>>     
>
> What mpirun/mpiexec syntax did you use?
>
>   
The mpirun is launched with the following options:
mpirun -prot -hostfile $TMPDIR/machines

and the environment variable MPI_REMSH is set to "$TMPDIR/rsh" to use 
the tight integration.

The rsh wrapper prints the qrsh commands it launches and this were the 
qrsh executed:

/opt/cesga/sge61/bin/lx24-ia64/qrsh -v 
LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.122 /opt/hpmpi/bin/mpid 2 0
33686785 10.128.1.32 60303 26917 /opt/hpmpi

/opt/cesga/sge61/bin/lx24-ia64/qrsh -v
LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.122 /opt/hpmpi/bin/mpid 6 0
33686785 10.128.1.32 60303 26917 /opt/hpmpi

/opt/cesga/sge61/bin/lx24-ia64/qrsh -v
LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.122 /opt/hpmpi/bin/mpid 10
0 33686785 10.128.1.32 60303 26917 /opt/hpmpi

/opt/cesga/sge61/bin/lx24-ia64/qrsh -v
LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid 13 0
33686785 10.128.1.32 60303 26917 /opt/hpmpi

/opt/cesga/sge61/bin/lx24-ia64/qrsh -v
LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid 1 0
33686785 10.128.1.32 60303 26917 /opt/hpmpi

/opt/cesga/sge61/bin/lx24-ia64/qrsh -v
LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.99 /opt/hpmpi/bin/mpid 15 0
33686785 10.128.1.32 60303 26917 /opt/hpmpi

/opt/cesga/sge61/bin/lx24-ia64/qrsh -v
LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid 5 0
33686785 10.128.1.32 60303 26917 /opt/hpmpi

/opt/cesga/sge61/bin/lx24-ia64/qrsh -v
LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.122 /opt/hpmpi/bin/mpid 14
0 33686785 10.128.1.32 60303 26917 /opt/hpmpi

/opt/cesga/sge61/bin/lx24-ia64/qrsh -v
LOADEDMODULES=icc/9.1.052:ifort/9.1.052:mkl/9.1:intel/9:hp-mpi -v
OMP_NUM_THREADS=8 -inherit -nostdin 10.128.1.40 /opt/hpmpi/bin/mpid 9 0
33686785 10.128.1.32 60303 26917 /opt/hpmpi

It seems the one corresponding to 10.128.1.99 (cn099) did not respond on 
time and the job exited. The rest of qrsh responded because I can see 
the corresponding accounting lines in the log.
> - was the machinefile honored?
>   
Yes, it was available as it shows the fact that the qrsh commands were 
launched the 4 nodes of the job.
> - what is the value of job_is_first_task in the PE setting?
>   
job_is_first_task FALSE

> -- Reuti
>
>   
Thanks,
Javier
>   
>> The messages in the qmaster are the following:
>> 11/19/2008 10:04:57|qmaster|cn142|E|execd at cn099.null reports  
>> running job (716631.1/1.cn099) in queue "large_queue at cn099.null"  
>> that was not supposed to be there - killing 11/19/2008 10:05:37| 
>> qmaster|cn142|E|execd at cn099.null reports running job  
>> (716631.1/1.cn099) in queue "large_queue at cn099.null" that was not  
>> supposed to be there - killing
>> And looking at the log of the node that is running the "non- 
>> existing" job we see:
>> 11/19/2008 10:04:25|execd|cn099|E|no free queue for job 716631 of  
>> user jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:25| 
>> execd|cn099|E|no free queue for job 716631 of user  
>> jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:25| 
>> execd|cn099|E|no free queue for job 716631 of user  
>> jlopez at cn032.null (localhost = cn099.null) 11/19/2008 10:04:57| 
>> execd|cn099|E|can't remove directory "active_jobs/716631.1":  
>> ==================== recursive_rmdir() failed 11/19/2008 10:05:37| 
>> execd|cn099|E|ja-task "716631.1" is unknown - reporting it to  
>> qmaster 11/19/2008 10:06:17|execd|cn099|E|acknowledge for unknown  
>> job 716631.1/master 11/19/2008 10:06:17|execd|cn099|E|incorrect  
>> config file for job 716631.1 11/19/2008 10:06:17|execd|cn099|E| 
>> can't remove directory "active_jobs/716631.1": opendir(active_jobs/ 
>> 716631.1) failed: No such file or directory
>> Analysing the situation looking at the output of the job we see  
>> that the job started in the MASTER node and tried to launch all its  
>> slave processes using qrsh. For some unknown reason the node cn099  
>> was unable to schedule the qrsh because of "no free queue" (I do  
>> not understand why) and the job failed. One minute later it seems  
>> that the qrsh process started in the cn 099 SLAVE node and the  
>> qmaster saw it and decided to kill it because the job had already  
>> finished.
>>
>> Do you know what could be the reason of a "no free queue for job"  
>> in a slave node when the task is submitted via qrsh -inherit?
>>
>> Thanks in advance,
>> Javier
>> <jlopez.vcf>
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89143
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89181

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, Text/X-VCARD (charset: UTF-8 "Internet-standard Unicode") ]
    [ (Name: "jlopez.vcf") 14 lines. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list