[GE users] Qs hanging - 'USED' at limit, but no load

hjmangalam harry.mangalam at uci.edu
Fri Jul 31 01:01:32 BST 2009


As a partial follow-up, if I use qmon to temporarily increase the number of 
slots/node the cluster queue can have, it re-sets the 'used' value to zero.  
So now all the 'used' values below are 0, but the jobs are still being held.

ie:
queuename       qtype resv/used/tot. load_avg    arch          states
quickbat64 at bduc BP    0/0/4            0.01     lx24-amd64
                        ^^
                        !

The error messages that I previously noted:
[queue instance "quickbat64.." dropped because it is full.]
are now gone as well, so besides the jobs not being executed, it /looks/ 
better. ;)

But still no job execution..

harry




On Thursday 30 July 2009 16:21:27 hjmangalam wrote:
> Hi All,
>
> The short:
> How can a slot be used without using any CPU?
>
> The long:
> After being well-behaved (but mostly idle) for quite a while, my cluster
> has started to misbehave.
>
> It's just recently seen a lot of job submissions and lately the jobs just
> queue without executing.  When queried:
>
> % qstat -f -q quickbat64
>
> queuename       qtype resv/used/tot. load_avg    arch          states
> quickbat64 at bduc BP    0/2/2            0.01     lx24-amd64
> quickbat64 at bduc BP    0/2/2            0.01     lx24-amd64
> quickbat64 at bduc BP    0/1/2            0.00     lx24-amd64
> quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> quickbat64 at bduc BP    0/2/2            0.00     lx24-amd64
> quickbat64 at bduc BP    0/2/2            0.01     lx24-amd64
> quickbat64 at bduc BP    0/2/2            0.01     lx24-amd64
> quickbat64 at bduc BP    0/2/2            0.22     lx24-amd64
>                         ^
> it looks like jobs are are being rejected because the 'used' slots equals
> the total slots, but the load average indicates that nothing is running and
> logging into the nodes shows them to be idle.
>
> In addition, the qmon job display gives this reason for the Q state:
> queue instance "quickbat64.." dropped because it is full.
>
> 'qconf -tsm' additionally shows that a number of jobs
> " cannot run in queue "XXX" because it is not contained in its hard queue
> list (-q)"
> (even when they were submitted with -q explicitly)
>
> and:
> queues dropped because they are full: quickbat64 at ....



-- 
Harry Mangalam - Research Computing, NACS, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  949 824-0084(o), 949 285-4487(c)
MSTB=Bldg 415 (G-5 on <http://today.uci.edu/pdf/UCI_09_map_campus.pdf>
---
pedalogogues: people who insist on riding bicycles to work and expect you to 
also. -karen lyons kalmenson-
And yes, I am one.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=210338

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list