[GE users] node oversubscription despite slots=8

gmareels guy.mareels at gmail.com
Mon May 17 10:50:20 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

Indeed I changed the configuration some weeks ago while the cluster was up and users were running on the nodes.
However, those jobs have long ended and the problems still pertains.

I have it on two nodes for sure, but I am unable to test on the other nodes due the the users running jobs.

How can I get SGE back in sync? Can the scheduler be restarted without losing currently pending and running jobs?

Best regards,
Guy

On Mon, May 17, 2010 at 11:25 AM, reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>> wrote:
Hi,

Am 17.05.2010 um 10:49 schrieb gmareels:

> We have 10 machines (nodes) with each 8 cores. I have set up each execution host with "complex_values slots=8".

fine.


> In addition, we have two queues (long.q and short.q) which differ in default priority and the time within which the job is to be completed). These two queues all have access to the 10 machines.
>
> So to prevent oversubscription of the machines, I set up the slots variable.
>
> When I submit a job of 8 cores to a single machine in the short queue(short at node001), and then afterwards submit another job of 8 cores to the same machine in the lonq.q, the latter jobs is correctly pending.
>
> However, when we submit a 4 core job to short at node001, and then submit a 8 core job in the long.q to the same machine (long at node001), then the latter job is allowed to progress. It should not be allowed to run as there are already 4 cores occupied!

Correct.


>
> "qstat -F" shows for long at node001
>
>        hc:slots=-4
>
>
> What is going wrong? How can I solve this?

Did you add the complex_value to each exechost while something was running in the system? Sometimes SGE gets out-of-sync then. But it should heal itself once the jobs are drained from the system.

You observe this on all machines in the cluster?

-- Reuti


> Many thanks,
> Guy
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=257569
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=257576

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].




More information about the gridengine-users mailing list