[GE users] Problem with clearing a suspended status from a queue instance.

reuti reuti at staff.uni-marburg.de
Wed Oct 20 19:42:36 BST 2010


Am 20.10.2010 um 19:22 schrieb mad:

> On Wed, Oct 20, 2010 at 11:48 AM, reuti <reuti at staff.uni-marburg.de> wrote:
> Am 20.10.2010 um 16:11 schrieb mad:
> 
> >
> > On Wed, Oct 20, 2010 at 9:24 AM, reuti <reuti at staff.uni-marburg.de> wrote:
> > Hi,
> >
> > Am 20.10.2010 um 14:49 schrieb mad:
> >
> > > I have tried to remove the Ss status from the queue instance by using qmon and clicking on force and resume.  That does not change the status.  I rebooted the host on which the host exists; that did not resume the queue.
> > >
> > > There are no jobs in the queue instance.
> > >
> > > This queue instance in the het queue is part of a subordinate queue het-2hr.
> > >
> > > [root at ted g03]# qmod -usq het at compute-0-31
> > > [root at ted g03]# qmod -usq het at compute-0-31.local
> > > [root at ted g03]# qmod -usq het-2hr at compute-0-31.local
> > > Queue instance "het-2hr at compute-0-31.local" is already in the specified state: unsuspended
> > > [root at ted g03]# qmod -usq het at compute-0-31.local
> > >
> > > The queue instance for het at compute-0-31.local in qmon still shows a status of Ss
> > >
> > > I am running ROCKS  5.0 , Centos 2.6.18-53.1.14.el5, Grid Engine 6.1u4
> >
> > I think an uppercase S means subordinated. Is there anything running in a superordinated queue?
> >
> > -- Reuti
> >
> > het is the queue including compute-0-30 through compute-0-33 which is subordinate to het-24hr and het-2hr.
> 
> Subordinated by slot? There is an issue where the last slot won't get un-suspended again and stays in "S".

Which type of subordination are you using?


> > het-24hr is a queue containing compute-0-30.  There is currently a job taking all slots in this queue.
> >
> > het-2hr is a queue containing compute-0-31 and compute-0-32.  There are no jobs in this queue.
> >
> > compute-0-31 is the instance showing the Ss status.
> >
> > CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS  cdsuE
> > -------------------------------------------------------------------------------
> > het                               0.20      0     24     32      0      8
> > het-24hr                          0.78      8      0      8      0      0
> > het-2hr                           0.00      0     16     16      0      0
> 
> This is strange: why is the "het" queue not having 8 slosts in "S" but in error state "E". Does:
> 
> $ qstat -f
> 
> show an error for this queue on certain machines?
> 
> --- Reuti
> 
> --
> 
> qstat -f     gives me 
> het at compute-0-31.local         BIP   0/8       0.00     lx26-amd64    Ss
> --
> het-2hr at compute-0-31.local     BIP   0/8       0.00     lx26-amd64      is fine.

We still have to investigate where "S" is coming from. Is it for sure not in any else queue's subordination list? As long as it's in state "S", it seems even not to be possible to apply `qmod -sq ...` or the opposite. I think this is an issue which shows up under these circumstances; I'll file an issue for it.

-- Reuti


> which is what qmon is showing as well.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=288678

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list