[GE users] Qs hanging - 'USED' at limit, but no load

hjmangalam harry.mangalam at uci.edu
Fri Jul 31 16:23:33 BST 2009


And finally, closing my own ticket, altho with a less-than-satisfying 
technical answer, restarting the qmaster immediately cleared the problem I 
described.  We'll see if  that problem returns.

If anyone has an idea of why this lockup happened in the 1st place, I'd be 
happy to hear from them..  

Returning to the previously scheduled entertainment.

hjm

On Thursday 30 July 2009 17:01:32 hjmangalam wrote:
> As a partial follow-up, if I use qmon to temporarily increase the number of
> slots/node the cluster queue can have, it re-sets the 'used' value to zero.
> So now all the 'used' values below are 0, but the jobs are still being
> held.
>
> ie:
> queuename       qtype resv/used/tot. load_avg    arch          states
> quickbat64 at bduc BP    0/0/4            0.01     lx24-amd64
>                         ^^
>                         !
>
> The error messages that I previously noted:
> [queue instance "quickbat64.." dropped because it is full.]
> are now gone as well, so besides the jobs not being executed, it /looks/
> better. ;)
>
> But still no job execution..
>
> harry
>
> On Thursday 30 July 2009 16:21:27 hjmangalam wrote:
> > Hi All,
> >
> > The short:
> > How can a slot be used without using any CPU?
> >
> > The long:
> > After being well-behaved (but mostly idle) for quite a while, my cluster
> > has started to misbehave.
> >
> > It's just recently seen a lot of job submissions and lately the jobs just
> > queue without executing.  When queried:
> >
> > % qstat -f -q quickbat64
> >
> > queuename       qtype resv/used/tot. load_avg    arch          states
> > quickbat64 at bduc BP    0/2/2            0.01     lx24-amd64
> > quickbat64 at bduc BP    0/2/2            0.01     lx24-amd64
> > quickbat64 at bduc BP    0/1/2            0.00     lx24-amd64
> > quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> > quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> > quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> > quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> > quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> > quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> > quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> > quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> > quickbat64 at bduc BP    0/2/2            0.01     lx24-amd64
> > quickbat64 at bduc BP    0/2/2            0.01     lx24-amd64
> > quickbat64 at bduc BP    0/2/2            0.22     lx24-amd64
> >                         ^
> > it looks like jobs are are being rejected because the 'used' slots equals
> > the total slots, but the load average indicates that nothing is running
> > and logging into the nodes shows them to be idle.
> >
> > In addition, the qmon job display gives this reason for the Q state:
> > queue instance "quickbat64.." dropped because it is full.
> >
> > 'qconf -tsm' additionally shows that a number of jobs
> > " cannot run in queue "XXX" because it is not contained in its hard queue
> > list (-q)"
> > (even when they were submitted with -q explicitly)
> >
> > and:
> > queues dropped because they are full: quickbat64 at ....



-- 
Harry Mangalam - Research Computing, NACS, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  949 824-0084(o), 949 285-4487(c)
MSTB=Bldg 415 (G-5 on <http://today.uci.edu/pdf/UCI_09_map_campus.pdf>
---
pedalogogues: people who insist on riding bicycles to work and expect you to 
also. -karen lyons kalmenson-
And yes, I am one.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=210451

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list