[GE users] Queue dropped because it is full

Kirk Patton kpatton at transmeta.com
Tue Apr 20 14:45:34 BST 2004


Hello,

I recently experienced jobs hanging with SGE 5.3.p5.

Jobs submitted to the cluster were pending with the message 
queue "common.op240-012" dropped because it is full...

There were only 4 jobs reported as submitted in the system.
:>qstat
job-ID  prior name       user         state submit/start at     queue      maste
r  ja-task-ID 
--------------------------------------------------------------------------------
-------------
    540     0 QLOGIN     justin       r     04/14/2004 11:46:34 common.op2 MASTE
R         
   3003     0 VOLDEMORT_ jayanto      qw    04/19/2004 17:11:02                 
          
   3006     0 ls         gsmith       qw    04/19/2004 17:43:59                 
          
   3011     0 who        kpatton      qw    04/20/2004 06:21:34                 
          
I look for similiar issues in the mailing list and came accross a post by
Ron Koester
>Date: Wed, 07 May 2003 09:10:04 -0400
>From: Ron Koester <koester at carbondesignsystems.com>
>Content-Type: text/plain; charset=us-ascii
>Subject: Re: [GE users] queue dropped because it is full
>
>
>Thanks for everyone's ideas -
>
>This morning I resorted to stopping/restarting the daemons on 
>the SGE master -- and... -- the problem went away.
>
>I did do a 'qstat -f' prior to restarting the master daemons,
>and the reported load for the two malfunctioning queues was
>correct.
>
>One other clue I got, if anyone cares at this point, is that 
>when I did './rcsge stop' on the SGE master, the schedd wasn't
>stopped.  Doing it again still didn't kill off the schedd.  So
>I ended up killing the schedd with 'kill -9'.
>
>Ron

This is the same exact behavior I noted with my cluster.  I shutdown the
master node and had to kill -9 sge_sched as it was running at 98% of the cpu.

Restarting the master brought the system back on line.

The acting master at the time of the incident was a Opteron based linux box running
2.4.24 kernel.

When the daemons came back up, there were 51 messages like the following
removing reference to no longer existing job of user "gsmith".

Does anyone know what could have caused SGE to stop processing jobs?

Thanks,
Kirk





-- 
Kirk Patton
Unix Administrator
Transmeta Inc.
Tel. 408 919-3055

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list