[GE users] PVM job thru SGE puts queue instance into E state

Sangamesh B forum.san at gmail.com
Thu Apr 17 07:05:58 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi all,

I've a query on PVM job submitted thru SGE with both loose and tight
integration.

PVM is installed with PVM_ROOT=/opt/MPI_LIBS/pvm3 and PVM_ARCH=LINUX64.

Downloaded the pvm tar from Reuti's gridengine Howto's and placed it in
/opt/gridengine

The cluster has one master node and a compute node with dual core dual
opteron amd64 processor.

A job with 4 slots gets executed on compute node for the first time. If the
same job is submitted again, SGE schedules this job to compute node. Then
compute node's queue instance goes into 'E' state. After this SGE again
schedules the same job into master node and gets executed.

I'm not getting why the compute node queue instance goes into E state.

These are PE environments for PVM:

# qconf -sp pvmloose
pe_name           pvmloose
slots             40
user_lists        locuzusers
xuser_lists       NONE
start_proc_args   /opt/gridengine/pvm/startpvm.sh $pe_hostfile $host  \
                  /opt/MPI_LIBS/pvm3
stop_proc_args    /opt/gridengine/pvm/stoppvm.sh $pe_hostfile $host
allocation_rule   4
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min


# qconf -sp pvmtight
pe_name           pvmtight
slots             40
user_lists        locuzusers
xuser_lists       NONE
start_proc_args   /opt/gridengine/pvm/startpvm.sh -catch_rsh $pe_hostfile \
                  $host /opt/MPI_LIBS/pvm3
stop_proc_args    /opt/gridengine/pvm/stoppvm.sh -catch_rsh $pe_hostfile
$host
allocation_rule   4
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

The job script is as follows:

$ cat loose.sh
#!/bin/bash

#$ -S /bin/bash

#$ -N PVMLOOSE

#$ -cwd

#$ -q all.q

#$ -e Err-$JOB_NAME-$JOB_ID

#$ -o Out-$JOB_NAME-$JOB_ID

/home/sangamesh/pvm3/bin/LINUX64/spmd $NSLOTS

exit 0


$ qsub -pe pvmloose 4 loose.sh
Your job 23 ("PVMLOOSE") has been submitted

$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q at compute-0-0.local        BIP   0/4       0.00     lx26-amd64    E
----------------------------------------------------------------------------
all.q at locuzcluster.org         BIP   0/4       0.00     lx26-amd64

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
     23 0.55500 PVMLOOSE   sangamesh    qw    04/17/2008 10:23:31     4


$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q at compute-0-0.local        BIP   0/4       0.00     lx26-amd64    E
----------------------------------------------------------------------------
all.q at locuzcluster.org         BIP   4/4       0.00     lx26-amd64
     23 0.55500 PVMLOOSE   sangamesh    r     04/17/2008 10:23:49     4


$ cat Out-PVMLOOSE-23
/opt/gridengine/default/spool/compute-0-0/active_jobs/23.1/pe_hostfile
compute-0-0.local /opt/MPI_LIBS/pvm3
/opt/gridengine/default/spool/compute-0-0/active_jobs/23.1/pe_hostfile
compute-0-0.local
/opt/gridengine/default/spool/locuzcluster/active_jobs/23.1/pe_hostfile
locuzcluster.org /opt/MPI_LIBS/pvm3
hostfile in TMPDIR /tmp/23.1.all.q/hostfile
/tmp/pvmtmp013483.0
start_pvm: enrolled to local pvmd
start_pvm: got 1 hosts
Pass a token through the   1 tid ring:
262146 -> 262146
token ring done
/opt/gridengine/default/spool/locuzcluster/active_jobs/23.1/pe_hostfile
locuzcluster.org

An 8 slots job runs as follows:

$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q at compute-0-0.local        BIP   4/4       0.02     lx26-amd64
     24 0.55500 PVMLOOSE   sangamesh    r     04/17/2008 10:48:57     4
----------------------------------------------------------------------------
all.q at locuzcluster.org         BIP   4/4       0.02     lx26-amd64
     24 0.55500 PVMLOOSE   sangamesh    r     04/17/2008 10:48:57     4

$ cat Out-PVMLOOSE-24
/opt/gridengine/default/spool/locuzcluster/active_jobs/24.1/pe_hostfile
locuzcluster.org /opt/MPI_LIBS/pvm3
hostfile in TMPDIR /tmp/24.1.all.q/hostfile
/tmp/pvmtmp013584.0
start_pvm: enrolled to local pvmd
start_pvm: got 2 hosts
Pass a token through the   1 tid ring:
262146 -> 262146
token ring done
/opt/gridengine/default/spool/locuzcluster/active_jobs/24.1/pe_hostfile
locuzcluster.org

$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q at compute-0-0.local        BIP   0/4       0.10     lx26-amd64
----------------------------------------------------------------------------
all.q at locuzcluster.org         BIP   0/4       0.07     lx26-amd64


After this any number of slots job, makes queue instance
all.q at compute-0-0.local (only) go into E state.

Has any of one you faced such problem?
How to resolve?


Thanks in advance,
Sangamesh



More information about the gridengine-users mailing list