[GE users] Advanced reservation for cluster outage?

eddale eddale at cs.unc.edu
Sat Jan 23 15:16:14 GMT 2010


I just wrote this up in a quick blog post at http://wp.me/p86be-5F.
Let me know if you have any questions about it.

Cheers,
Edward

s_kreidl said the following on 20.01.10 13:36:
> Hi Edward,
> I'd be absolutely interested in your way of handling this.
>
> Many thanks,
> Sabine
>
> eddale schrieb:
>> I went about this a different way.  Our cluster has a JSV setup that
>> adjusts the h_rt of the job so that it will end 5 minutes before a
>> cluster outage starts.  If this adjustment is made, the user is notified
>> with a prominent message to the console.  If you're interested in the
>> code, I could clean it up and post it.
>>
>> Cheers,
>> Edward
>>
>> s_kreidl said the following on 11.08.09 11:01:
>>
>>> I tried to do an advanced reservation to elegantly overcome an approaching cluster outage, but failed - if this is in general not the right approach for such a situation, please let me know how this is usually done. My major concern is to allow "backfilling" with jobs, which have an h_rt limit that would allow them to finish before the outage.
>>>
>>> We have SGE 6.2u2_1 installed.
>>> We have two queues, all.q and par.q, both with imposed h_rt runtime limits (identical with the scheduler's default_duration).
>>>
>>> I managed to reserve the majority of slots on the cluster with the following command line:
>>> qrsub -a 200908161700.00 -e 200908171700.00 -u test_user -q "*" -pe "openmpi*" 770-
>>>
>>> The resulting AR:
>>> # qrstat -ar 92
>>> id                             92
>>> name                           NetApp
>>> owner                          root
>>> state                          w
>>> start_time                     08/16/2009 17:00:00
>>> end_time                       08/17/2009 17:00:00
>>> duration                       24:00:00
>>> submission_time                08/10/2009 14:45:15
>>> group                          sge
>>> account                        sge
>>> granted_slots_list   par.q at n001=8,par.q at n003=8,...
>>> granted_parallel_environment   openmpi* slots 770-9999999
>>> acl_list                       test_user
>>>
>>> There are two things going wrong with respect to what I'm trying to do:
>>>
>>> 1. I can still submit all.q jobs with runtime limits too long to the reserved nodes. So, how do I reserve the whole cluster, rather than a queue, preferably within one single command line?
>>>
>>> 2. Jobs submitted to the par.q don't start, even if their runtime limit is well below the critical limit (I tried with -l h_rt=60). # qstat -g c" shows:
>>> CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE
>>> --------------------------------------------------------------------------------
>>> all.q                             0.51     86      0    418    504      0      0
>>> par.q                             1.00    624      0   -616      8      0      0
>>>
>>>
>>> Thanks in advance for your help.
>>> Best,
>>> Sabine
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211798
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239804
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239932
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=240577

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list