[GE users] Queue dropped because it is full

Kirk Patton kpatton at transmeta.com
Tue Apr 20 15:12:15 BST 2004


Hello Andy,

qstat -j reported the queues were full.  I do not have the output. I will be sure to save if should 
it happen again.  Since the cluster was restarted successfully, I am not able to try submiting with the
-now option.  The cluster had been up for over a month.

Are you interested in any of the log files?


Kirk
On Tue, Apr 20, 2004 at 03:49:12PM +0200, Andy Schwierskott wrote:
> Kirk,
> 
>   - was does "qstat -f -l q=common.op240-012" show?
>   - what does "qstat -j" show?
>   - what happens of the job is submitted with the "-now n" option?
> 
> Andy
> 
> > Hello,
> >
> > I recently experienced jobs hanging with SGE 5.3.p5.
> >
> > Jobs submitted to the cluster were pending with the message
> > queue "common.op240-012" dropped because it is full...
> >
> > There were only 4 jobs reported as submitted in the system.
> > :>qstat
> > job-ID  prior name       user         state submit/start at     queue      maste
> > r  ja-task-ID
> > --------------------------------------------------------------------------------
> > -------------
> >     540     0 QLOGIN     justin       r     04/14/2004 11:46:34 common.op2 MASTE
> > R
> >    3003     0 VOLDEMORT_ jayanto      qw    04/19/2004 17:11:02
> >
> >    3006     0 ls         gsmith       qw    04/19/2004 17:43:59
> >
> >    3011     0 who        kpatton      qw    04/20/2004 06:21:34
> >
> > I look for similiar issues in the mailing list and came accross a post by
> > Ron Koester
> > >Date: Wed, 07 May 2003 09:10:04 -0400
> > >From: Ron Koester <koester at carbondesignsystems.com>
> > >Content-Type: text/plain; charset=us-ascii
> > >Subject: Re: [GE users] queue dropped because it is full
> > >
> > >
> > >Thanks for everyone's ideas -
> > >
> > >This morning I resorted to stopping/restarting the daemons on
> > >the SGE master -- and... -- the problem went away.
> > >
> > >I did do a 'qstat -f' prior to restarting the master daemons,
> > >and the reported load for the two malfunctioning queues was
> > >correct.
> > >
> > >One other clue I got, if anyone cares at this point, is that
> > >when I did './rcsge stop' on the SGE master, the schedd wasn't
> > >stopped.  Doing it again still didn't kill off the schedd.  So
> > >I ended up killing the schedd with 'kill -9'.
> > >
> > >Ron
> >
> > This is the same exact behavior I noted with my cluster.  I shutdown the
> > master node and had to kill -9 sge_sched as it was running at 98% of the cpu.
> >
> > Restarting the master brought the system back on line.
> >
> > The acting master at the time of the incident was a Opteron based linux box running
> > 2.4.24 kernel.
> >
> > When the daemons came back up, there were 51 messages like the following
> > removing reference to no longer existing job of user "gsmith".
> >
> > Does anyone know what could have caused SGE to stop processing jobs?
> >
> > Thanks,
> > Kirk
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Kirk Patton
Unix Administrator
Transmeta Inc.
Tel. 408 919-3055

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list