[GE users] Queue dropped because it is full

Andy Schwierskott andy.schwierskott at sun.com
Tue Apr 20 14:49:12 BST 2004


Kirk,

  - was does "qstat -f -l q=common.op240-012" show?
  - what does "qstat -j" show?
  - what happens of the job is submitted with the "-now n" option?

Andy

> Hello,
>
> I recently experienced jobs hanging with SGE 5.3.p5.
>
> Jobs submitted to the cluster were pending with the message
> queue "common.op240-012" dropped because it is full...
>
> There were only 4 jobs reported as submitted in the system.
> :>qstat
> job-ID  prior name       user         state submit/start at     queue      maste
> r  ja-task-ID
> --------------------------------------------------------------------------------
> -------------
>     540     0 QLOGIN     justin       r     04/14/2004 11:46:34 common.op2 MASTE
> R
>    3003     0 VOLDEMORT_ jayanto      qw    04/19/2004 17:11:02
>
>    3006     0 ls         gsmith       qw    04/19/2004 17:43:59
>
>    3011     0 who        kpatton      qw    04/20/2004 06:21:34
>
> I look for similiar issues in the mailing list and came accross a post by
> Ron Koester
> >Date: Wed, 07 May 2003 09:10:04 -0400
> >From: Ron Koester <koester at carbondesignsystems.com>
> >Content-Type: text/plain; charset=us-ascii
> >Subject: Re: [GE users] queue dropped because it is full
> >
> >
> >Thanks for everyone's ideas -
> >
> >This morning I resorted to stopping/restarting the daemons on
> >the SGE master -- and... -- the problem went away.
> >
> >I did do a 'qstat -f' prior to restarting the master daemons,
> >and the reported load for the two malfunctioning queues was
> >correct.
> >
> >One other clue I got, if anyone cares at this point, is that
> >when I did './rcsge stop' on the SGE master, the schedd wasn't
> >stopped.  Doing it again still didn't kill off the schedd.  So
> >I ended up killing the schedd with 'kill -9'.
> >
> >Ron
>
> This is the same exact behavior I noted with my cluster.  I shutdown the
> master node and had to kill -9 sge_sched as it was running at 98% of the cpu.
>
> Restarting the master brought the system back on line.
>
> The acting master at the time of the incident was a Opteron based linux box running
> 2.4.24 kernel.
>
> When the daemons came back up, there were 51 messages like the following
> removing reference to no longer existing job of user "gsmith".
>
> Does anyone know what could have caused SGE to stop processing jobs?
>
> Thanks,
> Kirk

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list