[GE users] Jobs for cluster management?

Reuti reuti at staff.uni-marburg.de
Fri Dec 23 15:18:25 GMT 2005


Hi,

Am 23.12.2005 um 16:06 schrieb Chris Dagdigian:

>
> The problem is that nobody manages their cluster the same way,  
> especially when it comes to the specifics of system and OS management.
>
> For instance, most operators of larger clusters would never concern  
> themselves with the details of a single node - they build their  
> infrastructures so as to allow for a complete unattended bare-metal  
> OS installation over the network.  When these systems exist, all  
> you need to do is touch a TFTP config file and remotely power cycle  
> a node to have it completely wiped and replaced with the newest  
> image.  Updating systems becomes a single click operator task that  
> can either be done by hand on a busy system whenever the situation  
> allows or the process can be trivially scripted.  Deep SGE  
> integration is unnecessary.
>
> I guess my take then is that there will never be a suitable one- 
> solution-fits all way to do this within Grid Engine and I'd rather  
> have the developers working on scheduling/job-related RFE's and  
> enhancements.
>
> Although ...
>
> If we could narrow this down into a targeted RFE then it would  
> certainly be worth doing. For instance -- what about a enhancement  
> request that would allow a SGE operator to assign a custom status  
> message associated with disabled state "d" queues? The status  
> message would allow us to discern why nodes are disabled ("broken"  
> vs "needs_bios_update")  and we could also use qselect or XML qstat  
> output to programatically discover the nodes that are in "d" state  
> because they require maintenance.

http://gridengine.sunsource.net/servlets/ReadMsg? 
listName=users&msgNo=10239

If you want to combine it with qmod, a short script would do. - Reuti

> Having a custom state message associated with "d" queues would be  
> pretty useful. Other ideas? If we are going to make a RFE targeted  
> towards system management it should probably be very detailed and  
> specific as to how it will work.
>
> -Chris
>
>
>
>
>
> On Dec 23, 2005, at 8:10 AM, Jon Lockley wrote:
>
>> Well, kind of.
>>
>> You'd have "qmod -d" everything you need to upgrade and make a  
>> list of all
>> those nodes. Then frequently run a cron (or similar) job to check  
>> when
>> these nodes become empty. After applying the upgrade you then need to
>> remove it from the node list (so that you don't try to upgrade it
>> again). You shouldn't rely on the queue state as a test of  
>> upgrades as
>> there are all sorts of reasons why it might be disabled.
>>
>> So yes it's do-able with some scripting but *if* other folks are  
>> doing the
>> same stuff would it make any sense (and is it worth the hassle!)  
>> to make
>> it part of SGE?
>>
>> Thanks,
>>
>> Jon
>>
>> On Fri, 23 Dec 2005, Chris Dagdigian wrote:
>>
>>>
>>> Sounds like using Grid Engine's disable-queue function ("qmod -d
>>> <queue instance>") would get you the same thing:
>>>
>>> - running jobs are untouched in disabled queues
>>> - no user jobs are ever touched, suspended, re-queued or killed
>>> - no new work gets sent to disabled queues (thus draining the  
>>> machine)
>>> - you can easily disable every node in the cluster with ("qmod -d
>>> '*'") or in manageable groups
>>> - you know which nodes still need admin work done because they  
>>> are in
>>> state 'd'
>>> - a node that is rebooted for admin reasons (update; applied new
>>> kernel etc.) will still come online in 'disabled' state
>>>
>>>
>>> -Chris
>>>
>>>
>>> On Dec 23, 2005, at 7:46 AM, Jon Lockley wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'm wondering if the following is already possible (in a non-kludgy
>>>> way)
>>>> or whether it's something sensible to ask for as a new feature.
>>>>
>>>> Traditionally when we want to upgrade the software on nodes in a
>>>> cluster
>>>> we drain work off those nodes by shortening the wall clock limit
>>>> every few
>>>> hours such that it reaches zero when the work is scheduled.  This
>>>> is a bit
>>>> of a pain for the users but they prefer it to the alternative:
>>>> killing
>>>> all running jobs at a scheduled time. Obviously this means the  
>>>> cluster
>>>> gets fairly empty so I'm wondering of there's a better option.
>>>>
>>>> My idea is to have some form of "management job" in the SGE  
>>>> software.
>>>> Management jobs run once and once only on each node selected
>>>> (usually the
>>>> whole cluster I guess) as soon as the current (user) job on it
>>>> finishes.
>>>> In other words they jump ahead of regular user jobs on nodes which
>>>> haven't
>>>> yet run the management job. The node in question could then be
>>>> automatically released back to normal duties and eventually the  
>>>> whole
>>>> cluster will have been upgraded/changed.
>>>>
>>>> The advantages of doing things this way are 1) you don't have to
>>>> empty the
>>>> cluster or kill jobs to do upgrades, 2) you're not changing
>>>> anything while
>>>> users have jobs running and 3) SGE keeps track of which machines  
>>>> do/
>>>> don't
>>>> still need to execute the management tasks.
>>>>
>>>> I grant that this wouldn't be appropriate for every upgrade e.g.  
>>>> where
>>>> post-upgrade nodes can't work with pre-upgrade nodes for parallel
>>>> applications.  However I can see a lot of scenarios where it makes
>>>> sense
>>>> to couple the job scheduler with cluster management tasks to  
>>>> keep the
>>>> cluster as "alive" as possible at all times.
>>>>
>>>> So as I said, I'm curious to know if/how this can be done or
>>>> alternatively
>>>> if other people would find it a useful SGE feature.
>>>>
>>>> All the best,
>>>>
>>>> Jon
>>>>
>>>> ------------------------------------------------------------------- 
>>>> ---
>>>> --
>>>> | Dr Jon Lockley, Centre Manager
>>>> |                                   |
>>>> | Oxford Supercomputing Centre     | Email
>>>> jon.lockley at comlab.ox.ac.uk |
>>>> | Oxford University Computing Lab. | Tel +44 (0)1865
>>>> 283569            |
>>>> | Wolfson Building                 | Fax +44 (0)1865
>>>> 273839            |
>>>> | Parks Rd.                        |
>>>> www.osc.ox.ac.uk                  |
>>>> | Oxford, OX1 3QD                  | "Out of Darkness Cometh
>>>> Light"    |
>>>> | UK
>>>> |                                   |
>>>> ------------------------------------------------------------------- 
>>>> ---
>>>> --
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>>
>> --------------------------------------------------------------------- 
>> ---
>> | Dr Jon Lockley, Centre Manager    
>> |                                   |
>> | Oxford Supercomputing Centre     | Email  
>> jon.lockley at comlab.ox.ac.uk |
>> | Oxford University Computing Lab. | Tel +44 (0)1865  
>> 283569            |
>> | Wolfson Building                 | Fax +44 (0)1865  
>> 273839            |
>> | Parks Rd.                        |  
>> www.osc.ox.ac.uk                  |
>> | Oxford, OX1 3QD                  | "Out of Darkness Cometh  
>> Light"    |
>> | UK                                
>> |                                   |
>> --------------------------------------------------------------------- 
>> ---
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list