[GE users] Custom load sensor - correct way to time limit a host?

Reuti reuti at staff.uni-marburg.de
Fri Sep 28 23:00:59 BST 2007

Am 28.09.2007 um 22:18 schrieb skip at pobox.com:

>     reuti> Different amount of CPUs, i.e. slots, you can setup in  
> the queue
>     reuti> definition for each host/-group separate in one and the  
> same
>     reuti> queue definition.
> I'm confused.  I thought users had to submit their jobs to specific  
> queues.

This is more the way Torque operates: to submit into a queue. In SGE  
you will submit your job with resource requests and SGE will select a  
suitable queue which fulfils your requirements.

> A queue fronts one or more execution hosts, right?  What am I missing?

As you stated, you have different amount of CPUs in the machines.  
This can still be covered by one queue:

slots 2,[@quads=4]

>     reuti> a) you could define a calendar in SGE to suspend or  
> disable some
>     reuti> hosts (i.e. queue instances) during the day (man  
> calendar_conf)
> host == queue instance?  As I indicated, I just want my (very  
> naive) users
> to not have to worry about which queue to submit to.  Having only  
> queue
> means they can't make mistakes at submission time.

Exactly - one queue is sufficient in your case.

You could say so: each queue instance of a cluster queue runs on one  
host. There maybe of course many queue instance on every host, if  
there is more than one queue defined for this host.

>     reuti> b) define the queue with a "priority 19" (i.e. nice  
> value) on
>     reuti>    these machines, so that any local activity gets more  
> CPU time,
>     reuti>    keeping the running jobs (depends of course on the  
> memory
>     reuti>    requirements and more, whether this is suitable in  
> your case)
> Not an option.  I can't bear the context switch time necessary when  
> input
> comes in.  It has to be processed right now.

Okay. But this is defined in the kernel of the OS. SGE is not in the  
game at this level (unless special configured to renice jobs).

(Maybe you are thinking here about the suspend_threshold in the queue  
definition, which of course would need some time to show its effect.)

>     reuti> In any case I would suggest to give these machine a higher
>     reuti> sequence number and change the schedule setup to order  
> the queue
>     reuti> instances by "seqno", so that these machines are used last.
> That seems inadequate as well.

This was not c), but something I would suggest to be done whatever  
solution you chose: first fill the nodes which are in operation all  
the time, then use these machine, which are only available from time  
to time (unless they are turned off by the calendar - then the jobs  
have to wait).


Another point to discuss in the calendar configuration is, whether  
you want to drain these hosts just a few hour before they must be  
available, or just suspend any running job there at the time they  
must be available (or both, to prevent that a job will start there  
and be suspended after only 5 minutes).

If your applications support checkpointing, a suspend of a job could  
also trigger the migration to another machine.

-- Reuti

> Suppose I have ten hosts and users submit
> 100 jobs.  Isn't it likely that those high sequence number hosts  
> are going
> to come into play?  I simply can't have any compute jobs run on  
> these hosts
> under any circumstances (no matter how low the priority) during the  
> time
> window when they are reserved for other uses.  I potentially have 500
> computers at my disposal if I can guarantee this property.  If not,  
> I have
> maybe 50.
> Skip
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list