[GE users] core duo systems not accepting jobs

flengyel flengyel at gc.cuny.edu
Mon Jul 13 00:29:58 BST 2009


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]



-----Original Message-----
From: craffi [mailto:dag at sonsorol.org]
Sent: Sun 7/12/2009 7:17 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] core duo systems not accepting jobs

This is pretty strange, at this point I'd stop with trying to figure
out why real jobs are not running on your dual-cores and start
submitting some directed jobs that might flush out some better
errors ...

Try sending some test scripts directly at the dual core hosts and
queue a few times:

$ qrsh -q x86_64.q@@coreduos /bin/hostname

[flengyel at nept strange]$ qrsh -q x86_64.q@@coreduos /bin/hostname
m33

[flengyel at nept strange]$ qrsh -q x86_64.q@@coreduos /bin/hostnamem35

$ qsub -cwd -q x86_64.q@@coreduos $SGE_ROOT/examples/jobs/simple.sh

[flengyel at nept strange]$ qsub -cwd -q x86_64.q@@coreduos $SGE_ROOT/examples/jobs/simple.sh
Your job 23528 ("simple.sh") has been submitted
[flengyel at nept strange]$ ls -latrs
total 40
20 drwxr-xr-x 94 flengyel domusers 16384 Jul 12 19:22 ..
 8 -rw-r--r--  1 flengyel domusers    29 Jul 12 19:23 simple.sh.o23528
 4 -rw-r--r--  1 flengyel domusers     0 Jul 12 19:23 simple.sh.e23528
 8 drwxr-xr-x  2 flengyel domusers  4096 Jul 12 19:23 .



And maybe even some 1-way parallel requests just to see what happens:

$ qsub -cwd -pe gauss 2 -q x86_64.q@@coreduos $SGE_ROOT/examples/jobs/
simple.sh

[flengyel at nept strange]$ qsub -cwd -pe gauss 2 -q x86_64.q@@coreduos $SGE_ROOT/examples/jobs/simple.sh
Your job 23530 ("simple.sh") has been submitted
[flengyel at nept strange]$ qstat -u flengyel
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  23530 0.25000 simple.sh  flengyel     r     07/12/2009 19:25:01 x86_64.q at m37.gc.cuny.edu           2

This seems to be working !

I wonder if the trouble is with the job submission script gsub

[flengyel at nept strange]$ more /usr/local/bin/gsub
#!/bin/bash
if [ $# -lt 1 ]; then
  echo "Usage: gsub gaussianfile [qsub options]"
  exit
fi
ARGS=("$@")
QOPTS=${ARGS[@]:1}
qsub $QOPTS <<__HereDocument__
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -N $1
#$ -pe gauss 2
#$ -q x86_64.q

export g03root=/usr/local/gaussian
. /usr/local/gaussian/g03/bsd/g03.profile
export SGE_ROOT=/usr/local/sge
.  /usr/local/sge/default/common/settings.sh
export GAUSS_SCRDIR=/tmp

g03 $1
__HereDocument__




You've probably already done this but it's time to move beyond qstat
and qconf output, do you see anything in your SGE spool logs for the
qmaster host, the scheduler process or even the execd messages file
for some of the 2-way systems?


-Chris


I have local spool logs in $SGE_ROOT/spool on each execution host.
Not certain where to look for these...nothing in /usr/local/sge/spool/messages
for today on m35, for example...

Thanks again.

FL




On Jul 12, 2009, at 6:40 PM, flengyel wrote:

>
>
>
>
> -----Original Message-----
> From: craffi [mailto:dag at sonsorol.org]
> Sent: Sun 7/12/2009 6:39 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] core duo systems not accepting jobs
>
> Things look pretty good, a few queue instances down in 'au' state and
> one of your x86_64 hosts in load alarm state 'a' with some insane load
> average. Your quad.q hosts are almost totally maxed out.
>
> Indeed.
>
> And you do have a bunch of x86_64.q hosts with free job slots that are
> totally idle.
>
> Right
>
> Commenting only now on the "qstat -j" data you posted I'd zero in on
> this report from the scheduler:
>
> >                             cannot run because no access to pe
> "gauss"
> >                             cannot run in PE "gauss" because it only
> > offers 0 slot
>
>
>
> This brings to mind a few guesses:
>
> - Have you run out of "gauss" PE slots? How many are configured in the
> PE object?
>
> 9999
>
> - Is your user allowed to access that PE or is there a quota or ACL
> list that may be blocking them?
>
> Yes. No quota that I am aware of.
>
>
> - Is your user part of the "Research" group? You have access control
> configured on that queue via the "user_lists" parameter in the queue
> config
>
> Yes.
>
> -Chris
>
>
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206723

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].




More information about the gridengine-users mailing list