[GE users] PVM job thru SGE puts queue instance into E state

RRay at semtech.com RRay at semtech.com
Thu Apr 17 16:36:06 BST 2008


When the queue is in E state, does qstat -explain E give you any more 
information on what put it in E?



Reuti <reuti at staff.uni-marburg.de> wrote on 04/17/2008 11:13:30 AM:

> Hi,
> 
> Am 17.04.2008 um 08:05 schrieb Sangamesh B:
> 
> > I've a query on PVM job submitted thru SGE with both loose and 
> > tight integration.
> >
> > PVM is installed with PVM_ROOT=/opt/MPI_LIBS/pvm3 and 
> > PVM_ARCH=LINUX64.
> >
> > Downloaded the pvm tar from Reuti's gridengine Howto's and placed 
> > it in /opt/gridengine
> >
> > The cluster has one master node and a compute node with dual core 
> > dual opteron amd64 processor.
> >
> > A job with 4 slots gets executed on compute node for the first 
> > time. If the same job is submitted again, SGE schedules this job to 
> > compute node. Then compute node's queue instance goes into 'E' 
> > state. After this SGE again schedules the same job into master node 
> > and gets executed.
> >
> > I'm not getting why the compute node queue instance goes into E state.
> >
> > These are PE environments for PVM:
> >
> > # qconf -sp pvmloose
> > pe_name           pvmloose
> > slots             40
> > user_lists        locuzusers
> > xuser_lists       NONE
> > start_proc_args   /opt/gridengine/pvm/startpvm.sh $pe_hostfile 
> > $host  \
> >                   /opt/MPI_LIBS/pvm3
> > stop_proc_args    /opt/gridengine/pvm/stoppvm.sh $pe_hostfile $host
> > allocation_rule   4
> > control_slaves    FALSE
> > job_is_first_task TRUE
> > urgency_slots     min
> >
> >
> > # qconf -sp pvmtight
> > pe_name           pvmtight
> > slots             40
> > user_lists        locuzusers
> > xuser_lists       NONE
> > start_proc_args   /opt/gridengine/pvm/startpvm.sh -catch_rsh 
> > $pe_hostfile \
> >                   $host /opt/MPI_LIBS/pvm3
> > stop_proc_args    /opt/gridengine/pvm/stoppvm.sh -catch_rsh 
> > $pe_hostfile $host
> > allocation_rule   4
> > control_slaves    TRUE
> > job_is_first_task FALSE
> > urgency_slots     min
> >
> > The job script is as follows:
> >
> > $ cat loose.sh
> > #!/bin/bash
> >
> > #$ -S /bin/bash
> >
> > #$ -N PVMLOOSE
> >
> > #$ -cwd
> >
> > #$ -q all.q
> >
> > #$ -e Err-$JOB_NAME-$JOB_ID
> >
> > #$ -o Out-$JOB_NAME-$JOB_ID
> >
> > /home/sangamesh/pvm3/bin/LINUX64/spmd $NSLOTS
> >
> > exit 0
> >
> >
> > $ qsub -pe pvmloose 4 loose.sh
> > Your job 23 ("PVMLOOSE") has been submitted
> >
> > $ qstat -f
> > queuename                      qtype used/tot. load_avg 
> > arch          states
> > ---------------------------------------------------------------------- 

> > ------
> > all.q at compute-0-0.local        BIP   0/4       0.00     lx26- 
> > amd64    E
> > ---------------------------------------------------------------------- 

> > ------
> > all.q at locuzcluster.org         BIP   0/4       0.00     lx26-amd64
> >
> > ###################################################################### 

> > ######
> >  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - 
> > PENDING JOBS
> > ###################################################################### 

> > ######
> >      23 0.55500 PVMLOOSE   sangamesh    qw    04/17/2008 
> > 10:23:31     4
> >
> >
> > $ qstat -f
> > queuename                      qtype used/tot. load_avg 
> > arch          states
> > ---------------------------------------------------------------------- 

> > ------
> > all.q at compute-0-0.local        BIP   0/4       0.00     lx26- 
> > amd64    E
> > ---------------------------------------------------------------------- 

> > ------
> > all.q at locuzcluster.org         BIP   4/4       0.00     lx26-amd64
> >      23 0.55500 PVMLOOSE   sangamesh    r     04/17/2008 
> > 10:23:49     4
> >
> >
> > $ cat Out-PVMLOOSE-23
> > /opt/gridengine/default/spool/compute-0-0/active_jobs/23.1/ 
> > pe_hostfile compute-0-0.local /opt/MPI_LIBS/pvm3
> > /opt/gridengine/default/spool/compute-0-0/active_jobs/23.1/ 
> > pe_hostfile compute-0-0.local
> > /opt/gridengine/default/spool/locuzcluster/active_jobs/23.1/ 
> > pe_hostfile locuzcluster.org /opt/MPI_LIBS/pvm3
> 
> the job was reschduled and ran the second time successful, but on the 
> headnode. Anything in the Err-File? Or the messages file of the node 
> compute-0-0.local?
> 
> -- Reuti
> 
> >
> > hostfile in TMPDIR /tmp/23.1.all.q/hostfile
> > /tmp/pvmtmp013483.0
> > start_pvm: enrolled to local pvmd
> > start_pvm: got 1 hosts
> > Pass a token through the   1 tid ring:
> > 262146 -> 262146
> > token ring done
> > /opt/gridengine/default/spool/locuzcluster/active_jobs/23.1/ 
> > pe_hostfile locuzcluster.org
> >
> > An 8 slots job runs as follows:
> >
> > $ qstat -f
> > queuename                      qtype used/tot. load_avg 
> > arch          states
> > ---------------------------------------------------------------------- 

> > ------
> > all.q at compute-0-0.local        BIP   4/4       0.02     lx26-amd64
> >      24 0.55500 PVMLOOSE   sangamesh    r     04/17/2008 
> > 10:48:57     4
> > ---------------------------------------------------------------------- 

> > ------
> > all.q at locuzcluster.org         BIP   4/4       0.02     lx26-amd64
> >      24 0.55500 PVMLOOSE   sangamesh    r     04/17/2008 
> > 10:48:57     4
> >
> > $ cat Out-PVMLOOSE-24
> > /opt/gridengine/default/spool/locuzcluster/active_jobs/24.1/ 
> > pe_hostfile locuzcluster.org /opt/MPI_LIBS/pvm3
> > hostfile in TMPDIR /tmp/24.1.all.q/hostfile
> > /tmp/pvmtmp013584.0
> > start_pvm: enrolled to local pvmd
> > start_pvm: got 2 hosts
> > Pass a token through the   1 tid ring:
> > 262146 -> 262146
> > token ring done
> > /opt/gridengine/default/spool/locuzcluster/active_jobs/24.1/ 
> > pe_hostfile locuzcluster.org
> >
> > $ qstat -f
> > queuename                      qtype used/tot. load_avg 
> > arch          states
> > ---------------------------------------------------------------------- 

> > ------
> > all.q at compute-0-0.local        BIP   0/4       0.10     lx26-amd64
> > ---------------------------------------------------------------------- 

> > ------
> > all.q at locuzcluster.org         BIP   0/4       0.07     lx26-amd64
> >
> >
> > After this any number of slots job, makes queue instance 
> > all.q at compute-0-0.local (only) go into E state.
> >
> > Has any of one you faced such problem?
> > How to resolve?
> >
> >
> > Thanks in advance,
> > Sangamesh
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



More information about the gridengine-users mailing list