[GE users] Does nice always work when determining which waiti ng job to assign to a node first?

Marconnet, James E Mr /Computer Sciences Corporation james.marconnet at smdc.army.mil
Tue Apr 12 20:13:08 BST 2005


Reuti:

I'm have two technical groups, each with what we call a primary que and a
secondary que. So call the ques G1p, G1s, G2p, G2s. The nodes are split
evenly between the two groups. G1p contains the same nodes as G2s, except
with nice=0 or 19 respectively. And so forth. The idea was to allow both
groups to use as many available nodes as possible, without wasting half the
nodes when no one in the other group needs to run anything. 

To prevent running 2 or more jobs on nodes simultaneously, G1p is
subordinate to G2s, and so forth, per your suggestion in another thread.

We thought that setting nice would affect the scheduling of waiting jobs to
nodes. Turns out it does not. 

So if G1 submits a gazillion jobs to both his group's primary and secondary
ques, then all the cluster nodes get used very efficiently, but nobody in G2
gets to start even one job till all the G1 jobs begin and then at least one
job finishes.

Yes, we have load threshholds set (since there is some interactive testing
and since some nodes are dual-processor, hyperthreaded - another story!),
but we are not currently suspending jobs. Some users have short jobs, and
some have much longer jobs. So suspending jobs seemed iffy.

Probably clear as mud, but hope it helps someone understand and suggest an
approach.

Jim

-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Tuesday, April 12, 2005 1:36 PM
To: users at gridengine.sunsource.net
Subject: RE: [GE users] Does nice always work when determining which waiti
ng job to assign to a node first?

Hi Jim,

I'm still not sure about your intended setup. You have two cluster queues -
one with nice=0, the other with nice=19 and any load_thresholds and
subordination?

One user will submit qsub -q nice0que, the other qsub -q nice19que?

Quoting "Marconnet, James E Mr /Computer Sciences Corporation" 
<james.marconnet at smdc.army.mil>:

> Bummer, we thought the nice value would affect the order in which 
> waiting jobs were assigned to the nodes. Apparently not so.
> 
> I searched the Admin Manual on seq_no, and I did not see where that 
> could be used unless we wanted to give up sorting by load level to 
> balance out the load on the nodes instead of filling up the first node 
> completely, then the next one fully, etc. And it's not at all clear 
> how this would be used anyway. Anyone able to clarify it?
> 
> Reading from the Admin manual: 
> Without any administrator influence, the order is first-in-first-out 
> (FIFO).
> 
> The administrator has the following means to control the job order:
> ^A Ticket-based job priority. ....
> ^A Urgency-based job priority. ....
> ^A POSIX priority.....
> 
> Is there an easy way to tie one of these methods to the que which was 
> specified? I don't want the user to have to specify additional options 
> (that I have to explain and to police) other than the que if it can be 
> helped.
> 
> And we'd prefer not to suspend jobs, but to let the running jobs 
> complete before starting new jobs. Suspending jobs would wreck havoc 
> on our completion predictions.

But you mentioned subordinated queues - they will be suspended then.

CU - Reuti

> 
> Perhaps I just want too much!?
> 
> Thanks,
> Jim
> 
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Tuesday, April 12, 2005 10:40 AM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Does nice always work when determining which 
> waiting job to assign to a node first?
> 
> Hi,
> 
> nice is not used for scheduling, but you can use a seq_no for the two 
> queue types, to fill first the nice=0 queue. But suspending a nice=19 
> queue - mhh then this queue could also have just nice of 0, as it's 
> suspended anyway, if the nice=0 queue is filled (if I got you 
> correctly).
> 
> CU - Reuti
> 
> 
> Marconnet, James E Mr /Computer Sciences Corporation wrote:
> > Using 6.0u3. Had some reports yesterday that some waiting jobs from 
> > a que with nice=0 were still waiting after some waiting jobs from a 
> > different que with nice=19 started running on nodes previously 
> > running jobs with nice=19. The "wronged" users figuratively went on 
> > the warpath soon afterwards.
> > 
> > We are using que subordination to prevent too many jobs from 
> > different ques from running on the same nodes at the same time, but 
> > that works on a node-by-node basis, and it seemed to be working OK.
> > 
> > Anything In particular I should know about this or to look for 
> > settings-wise?
> > 
> > Thanks!
> > Jim Marconnet
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list