[GE users] nodes overloaded: processes placed on already full nodes

reuti reuti at staff.uni-marburg.de
Tue Dec 21 17:22:13 GMT 2010

Am 21.12.2010 um 15:58 schrieb steve_s:

> On Dec 17 14:16 +0100, reuti wrote:
>>>   $ qsub -hard -l 'np_load_avg < 0.3' ...
>> You can only specify a value, the relation is defined already in the
>> complex definition.
> [...]
>> When > is working, it's a bug. I get: Unable to run job: unknown
>> resource "fubar>12". (same for <, maybe it was fixed in 6.2u5).
> Yes, you are right. The only thing that works is "=":
>    $ qsub -hard -l 'np_load_avg=0.3' ...
> That is no solution to the original problem, though (but apparently not
> required, either -- see my last post).
> [...] 
>> What does:
>> ps -e f
>> (f w/o -) show on such a node? Are all the processes bound to an
>> sge_shepherd, or did some jump out of the processes tree and weren't
>> killed?
> There are no sge_shepherd's on the nodes. I did not set up SGE on the
> machine but what I understand from the documentation is that
> sge_shepherd is only used in the case of "tight integration" of PEs.
> In our case, the PE starts the MPI processes.

Well, even with a loose integration, you have to honor the lost of granted machines for your job. What do you mean in detail by "the PE starts the MPI processes"? You will need at least a sgeexecd on the nodes, so that SGE is aware of its existence and can make a suitable slot allocation for your job. (The sgeexecd will then start the shepherd in case of a tight integration.)

-- Reuti

> best,
> Steve
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307894
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list