[GE users] Jobs for cluster management?

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Tue Jan 3 09:29:38 GMT 2006


Hi John,

just a couple additional comments.

You can use the calendar to disable a queue at a specific time. This
will result in:

- only jobs will be started, that will be finished before the queue is
suspended

- jobs, which run longer than they requested will be killed/ rescheduled/...
when the calendar disables the queue

- You could create a maintenance queue, which will only be open for the
maintenance
jobs. With the use of the the available policies, it is possible to put
it in front
of all the other jobs.

- All the other queues could be subordinate to the maintenance queue,
which will force
a reschedule of the running job, as soon as the maintenance job is running.

- Each job can have a specific start time with it. Ensuring that the job
gets executed
when one what to have it running.

- You could use qalter of the maintenance job was submitted to change
all the pending
jobs to wait for the maintenance job to finish

I believe what you describe is already possible with some kind of
special setup.

Cheers,
Stephan



Jon Lockley wrote On 12/23/05 18:30,:

>On Fri, 23 Dec 2005, Chris Dagdigian wrote:
>
>  
>
>>The problem is that nobody manages their cluster the same way,
>>especially when it comes to the specifics of system and OS management.
>>    
>>
>
>Abolutely, life would be too easy if we did ;-) On the other hand we all
>do system upgrade/maintenance tasks which have to be scheduled around the
>work of the users - hence my thought that the easiest way to do this might
>actually be with the job scheduler itself.
>
>  
>
>>For instance, most operators of larger clusters would never concern
>>themselves with the details of a single node - they build their
>>infrastructures so as to allow for a complete unattended bare-metal
>>OS installation over the network.  When these systems exist, all you
>>need to do is touch a TFTP config file and remotely power cycle a
>>node to have it completely wiped and replaced with the newest image.
>>Updating systems becomes a single click operator task that can either
>>be done by hand on a busy system whenever the situation allows or the
>>process can be trivially scripted.  Deep SGE integration is unnecessary.
>>    
>>
>
>Sure, it's how we used to do things on our old IBM cluster running XCAT
>but you still have to do the single-click or whatever at the right time to
>fit around peoples work.  In this example a "management job" running a
>script which simply says "reboot" means that as soon as a users job
>finishes on a node it reboots and upgrades from the new image.
>Eventually SGE would have coordinated an upgrade across the cluster
>without you having to either drain it, do any monitoring or go clicking on
>icons in GUIs (remembering which ones have already been done of course).
>As I said, not always the way you want to do things but sometimes it could
>be useful - depends on what you're doing.
>
>  
>
>>I guess my take then is that there will never be a suitable one-
>>solution-fits all way to do this within Grid Engine and I'd rather
>>have the developers working on scheduling/job-related RFE's and
>>enhancements.
>>    
>>
>
>I'm not suggesting that SGE should be integrated with any cluster
>management system - as you say there are too many. All I'm suggesting is
>the ability to run a script once on every node with priority over user
>jobs. What's in the script depends on your OS and cluster management
>software. And yes, this can be done with a monitoring script and qmod -d,
>but it seems like a common and useful task so is it worth doing it
>"neatly" in SGE?
>
>  
>
>>Although ...
>>
>>If we could narrow this down into a targeted RFE then it would
>>certainly be worth doing. For instance -- what about a enhancement
>>request that would allow a SGE operator to assign a custom status
>>message associated with disabled state "d" queues? The status message
>>would allow us to discern why nodes are disabled ("broken" vs
>>"needs_bios_update")  and we could also use qselect or XML qstat
>>output to programatically discover the nodes that are in "d" state
>>because they require maintenance.
>>    
>>
>
>Hmmm, interesting idea. For example you could put a note on a node saying
>"this machine will be rebooted and upgraded on insert-date-here".
>
>Cheers
>
>Jon
>
>------------------------------------------------------------------------
>| Dr Jon Lockley, Centre Manager   |                                   |
>| Oxford Supercomputing Centre     | Email jon.lockley at comlab.ox.ac.uk |
>| Oxford University Computing Lab. | Tel +44 (0)1865 283569            |
>| Wolfson Building                 | Fax +44 (0)1865 273839            |
>| Parks Rd.                        | www.osc.ox.ac.uk                  |
>| Oxford, OX1 3QD                  | "Out of Darkness Cometh Light"    |
>| UK                               |                                   |
>------------------------------------------------------------------------
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list