[GE users] PVM job thru SGE puts queue instance into E state

Reuti reuti at staff.uni-marburg.de
Thu Apr 17 16:13:30 BST 2008


Hi,

Am 17.04.2008 um 08:05 schrieb Sangamesh B:

> I've a query on PVM job submitted thru SGE with both loose and  
> tight integration.
>
> PVM is installed with PVM_ROOT=/opt/MPI_LIBS/pvm3 and  
> PVM_ARCH=LINUX64.
>
> Downloaded the pvm tar from Reuti's gridengine Howto's and placed  
> it in /opt/gridengine
>
> The cluster has one master node and a compute node with dual core  
> dual opteron amd64 processor.
>
> A job with 4 slots gets executed on compute node for the first  
> time. If the same job is submitted again, SGE schedules this job to  
> compute node. Then compute node's queue instance goes into 'E'  
> state. After this SGE again schedules the same job into master node  
> and gets executed.
>
> I'm not getting why the compute node queue instance goes into E state.
>
> These are PE environments for PVM:
>
> # qconf -sp pvmloose
> pe_name           pvmloose
> slots             40
> user_lists        locuzusers
> xuser_lists       NONE
> start_proc_args   /opt/gridengine/pvm/startpvm.sh $pe_hostfile  
> $host  \
>                   /opt/MPI_LIBS/pvm3
> stop_proc_args    /opt/gridengine/pvm/stoppvm.sh $pe_hostfile $host
> allocation_rule   4
> control_slaves    FALSE
> job_is_first_task TRUE
> urgency_slots     min
>
>
> # qconf -sp pvmtight
> pe_name           pvmtight
> slots             40
> user_lists        locuzusers
> xuser_lists       NONE
> start_proc_args   /opt/gridengine/pvm/startpvm.sh -catch_rsh  
> $pe_hostfile \
>                   $host /opt/MPI_LIBS/pvm3
> stop_proc_args    /opt/gridengine/pvm/stoppvm.sh -catch_rsh  
> $pe_hostfile $host
> allocation_rule   4
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> The job script is as follows:
>
> $ cat loose.sh
> #!/bin/bash
>
> #$ -S /bin/bash
>
> #$ -N PVMLOOSE
>
> #$ -cwd
>
> #$ -q all.q
>
> #$ -e Err-$JOB_NAME-$JOB_ID
>
> #$ -o Out-$JOB_NAME-$JOB_ID
>
> /home/sangamesh/pvm3/bin/LINUX64/spmd $NSLOTS
>
> exit 0
>
>
> $ qsub -pe pvmloose 4 loose.sh
> Your job 23 ("PVMLOOSE") has been submitted
>
> $ qstat -f
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-0.local        BIP   0/4       0.00     lx26- 
> amd64    E
> ---------------------------------------------------------------------- 
> ------
> all.q at locuzcluster.org         BIP   0/4       0.00     lx26-amd64
>
> ###################################################################### 
> ######
>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -  
> PENDING JOBS
> ###################################################################### 
> ######
>      23 0.55500 PVMLOOSE   sangamesh    qw    04/17/2008  
> 10:23:31     4
>
>
> $ qstat -f
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-0.local        BIP   0/4       0.00     lx26- 
> amd64    E
> ---------------------------------------------------------------------- 
> ------
> all.q at locuzcluster.org         BIP   4/4       0.00     lx26-amd64
>      23 0.55500 PVMLOOSE   sangamesh    r     04/17/2008  
> 10:23:49     4
>
>
> $ cat Out-PVMLOOSE-23
> /opt/gridengine/default/spool/compute-0-0/active_jobs/23.1/ 
> pe_hostfile compute-0-0.local /opt/MPI_LIBS/pvm3
> /opt/gridengine/default/spool/compute-0-0/active_jobs/23.1/ 
> pe_hostfile compute-0-0.local
> /opt/gridengine/default/spool/locuzcluster/active_jobs/23.1/ 
> pe_hostfile locuzcluster.org /opt/MPI_LIBS/pvm3

the job was reschduled and ran the second time successful, but on the  
headnode. Anything in the Err-File? Or the messages file of the node  
compute-0-0.local?

-- Reuti

>
> hostfile in TMPDIR /tmp/23.1.all.q/hostfile
> /tmp/pvmtmp013483.0
> start_pvm: enrolled to local pvmd
> start_pvm: got 1 hosts
> Pass a token through the   1 tid ring:
> 262146 -> 262146
> token ring done
> /opt/gridengine/default/spool/locuzcluster/active_jobs/23.1/ 
> pe_hostfile locuzcluster.org
>
> An 8 slots job runs as follows:
>
> $ qstat -f
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-0.local        BIP   4/4       0.02     lx26-amd64
>      24 0.55500 PVMLOOSE   sangamesh    r     04/17/2008  
> 10:48:57     4
> ---------------------------------------------------------------------- 
> ------
> all.q at locuzcluster.org         BIP   4/4       0.02     lx26-amd64
>      24 0.55500 PVMLOOSE   sangamesh    r     04/17/2008  
> 10:48:57     4
>
> $ cat Out-PVMLOOSE-24
> /opt/gridengine/default/spool/locuzcluster/active_jobs/24.1/ 
> pe_hostfile locuzcluster.org /opt/MPI_LIBS/pvm3
> hostfile in TMPDIR /tmp/24.1.all.q/hostfile
> /tmp/pvmtmp013584.0
> start_pvm: enrolled to local pvmd
> start_pvm: got 2 hosts
> Pass a token through the   1 tid ring:
> 262146 -> 262146
> token ring done
> /opt/gridengine/default/spool/locuzcluster/active_jobs/24.1/ 
> pe_hostfile locuzcluster.org
>
> $ qstat -f
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-0.local        BIP   0/4       0.10     lx26-amd64
> ---------------------------------------------------------------------- 
> ------
> all.q at locuzcluster.org         BIP   0/4       0.07     lx26-amd64
>
>
> After this any number of slots job, makes queue instance  
> all.q at compute-0-0.local (only) go into E state.
>
> Has any of one you faced such problem?
> How to resolve?
>
>
> Thanks in advance,
> Sangamesh


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list