Re(2): [GE users] slot taken no job running

Andreas Haas Andreas.Haas at Sun.COM
Wed Aug 18 11:06:24 BST 2004


Hi Don,

I filed issue #1236 for this. You'll find there instructions
how to proceed to overcome the slot inconsistency in your
cluster and also what to do in general to prevent that problem
from occuring in future. Hope it helps.

Cheers,
Andreas

On Tue, 17 Aug 2004, Don Shesnicky wrote:

>
>
> > Did you had a look at the messages file of the queue master in
> $SGE_ROOT/default/spool/qmaster/messages?
> >
> > I would suggest to shut down the execd on kane.enqsemi.com, then go to
> the spool directory of this
> > node $SGE_ROOT/default/spool/kane.enqsemi.com and look inside the
> directories: active_jobs, jobs,
> > job_scripts and remove all what's ever inside (you can also have a
> look there before you shut down
> > anything, just to see, whether there is more than one job mentioned at
> all, otherwise we have to look
> > somewhere else). Then restart the execd and we will see, whether it's
> gone.
>
> I've looked in the qmaster messages file and only this stands out:
>
> 08/16/2004 09:40:27|qmaster|canter|E|error writing object with key
> "EXECHOST:kane.enqsemi.com"
>    into berkeley database: (28) No space left on device
> 08/16/2004 09:43:11|qmaster|canter|W|job 8062.1 failed on host
> kane.enqsemi.com general assumedly
>    before job because: can't create directory active_jobs/8062.1: No
> space left on device
>
> I take it that job 8062.1 is the problem and is probably stuck in the
> berkeley db.
>
> I've totally removed the kane directory in $SGE_ROOT/default/spool/kane
> and re-installed it as
> an exec host but that did nothing, it's still showing 1/2 slots full
> while no jobs are running on it.
>
>
> ------------------------------------------------------------------------
> ----
> d.norm at kane.enqsemi.com        BIP   1/2       0.14     lx24-x86      o
> ------------------------------------------------------------------------
> ----
>
> job-ID  prior   name       user         state submit/start at     queue
> slots ja-task-ID
> ------------------------------------------------------------------------
> -----------------------------------------
>    8591 0.56000 tc_049_ab_ mllalami     r     08/17/2004 12:31:08
> d.norm at dexter.enqsemi.com          1
>    8592 0.56000 tc_049_ab_ mllalami     r     08/17/2004 12:34:19
> d.norm at dexter.enqsemi.com          1
>    8589 0.56000 tc_037_ei_ jrusmussen   r     08/17/2004 12:30:02
> d.norm at forge.enqsemi.com           1
>
>
>
>
> Don
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list