[GE users] nodes overloaded: processes placed on already full nodes

reuti reuti at staff.uni-marburg.de
Fri Dec 17 13:16:49 GMT 2010

Am 15.12.2010 um 17:23 schrieb steve_s:

> On Dec 15 16:28 +0100, reuti wrote:
>> Am 15.12.2010 um 16:13 schrieb templedf:
>>> This is a known issue.  When scheduling parallel jobs with 6.2 to 6.2u5, 
>>> the scheduler ignores host load.
>> Yep.
>>> This often results in jobs piling up 
>>> on a few nodes while other nodes are idle.
> OK, good to know. We're running 6.2u3 here.
> I'm not sure if I get this right: Even if the load is ignored, doesn't
> SGE keep track of already given-away slots on each node? I always
> thought that this is the way jobs are scheduled in the first place
> (besides policies and all that, but that should have nothing to do with
> load or slots in this context).
> Given that SGE knows i.e. np_load_avg on each node, I thought we could
> circumvent the problem by setting np_load_avg to requestable=YES and
> then something like
>    $ qsub -hard -l 'np_load_avg < 0.3' ...

You can only specify a value, the relation is defined already in the complex definition.

> but this gives me 
>    "Unable to run job: denied: missing value for request "np_load_avg".
>     Exiting."
> whereas using "=" or ">" works. I guess the reason is what is stated in

When > is working, it's a bug. I get: Unable to run job: unknown resource "fubar>12". (same for <, maybe it was fixed in 6.2u5).

> complex(5):
>    ">=, >, <=, < operators can only be overridden, when the new value
>     is more restrictive than the old one."
> So, I cannot use "<". If that is the case, what can we do about it? Do
> we need to define a new complex attribute (say 'np_load_avg_less') along
> with a load_sensor or can we hijack np_load_avg in another way?
>> As far as I understood the problem, the nodes are oversubscribed by getting more than 8 processes scheduled.
> Exactly.

So, we now what to deal with.

>> Did you change the host assignment to certain queues, while jobs were still running? Maybe you need to limit the number total slots per machine to 8 in an RQS or setting it for each host's complex_values.
> No, we didn't change the host assignment. 
> Sorry, but what do you mean by RQS? Did not see that in the
> documentation so far. 

man sge_resource_quota

When you have more than one queue on a maschine, all slots might get used and thus oversubscribing the machine. Hence the total number of used slots across all queues at a time on each machine must be limited. When you have only one queue per machine, then this can't happen though.

>> Another reason for virtual oversubscription: processes in state "D" count as running and dispite the fact of the high load, all is in best order.
> Oversubscribed nodes do not always run 16 instead of 8 processes, some
> only 14 or so. Nevertheless, the load is always almost exactly 16. As
> far as I can see, processes on these oversubscribed nodes (with > 8
> processes) run with ~50% CPU load each.

What does:

ps -e f

(f w/o -) show on such a node? Are all the processes bound to an sge_shepherd, or did some jump out of the processes tree and weren't killed?

-- Reuti


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list