[GE users] node oversubscription despite slots=8

reuti reuti at staff.uni-marburg.de
Mon May 17 14:54:57 BST 2010


Am 17.05.2010 um 11:50 schrieb gmareels:

> Hello,
>  
> Indeed I changed the configuration some weeks ago while the cluster was up and users were running on the nodes.
> However, those jobs have long ended and the problems still pertains.
>  
> I have it on two nodes for sure, but I am unable to test on the other nodes due the the users running jobs.
>  
> How can I get SGE back in sync? Can the scheduler be restarted without losing currently pending and running jobs?

The sgemaster can be stopped and started at any time w/o interfering with running jobs. They all will reappear in the joblist.

Disclaimer: anyway, use on your own risk.

-- Reuti


> Best regards,
> Guy
> 
> On Mon, May 17, 2010 at 11:25 AM, reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
> 
> Am 17.05.2010 um 10:49 schrieb gmareels:
> 
> > We have 10 machines (nodes) with each 8 cores. I have set up each execution host with "complex_values slots=8".
> 
> fine.
> 
> 
> > In addition, we have two queues (long.q and short.q) which differ in default priority and the time within which the job is to be completed). These two queues all have access to the 10 machines.
> >
> > So to prevent oversubscription of the machines, I set up the slots variable.
> >
> > When I submit a job of 8 cores to a single machine in the short queue(short at node001), and then afterwards submit another job of 8 cores to the same machine in the lonq.q, the latter jobs is correctly pending.
> >
> > However, when we submit a 4 core job to short at node001, and then submit a 8 core job in the long.q to the same machine (long at node001), then the latter job is allowed to progress. It should not be allowed to run as there are already 4 cores occupied!
> 
> Correct.
> 
> 
> >
> > "qstat -F" shows for long at node001
> >
> >        hc:slots=-4
> >
> >
> > What is going wrong? How can I solve this?
> 
> Did you add the complex_value to each exechost while something was running in the system? Sometimes SGE gets out-of-sync then. But it should heal itself once the jobs are drained from the system.
> 
> You observe this on all machines in the cluster?
> 
> -- Reuti
> 
> 
> > Many thanks,
> > Guy
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=257569
> >
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=257576
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=257602

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list