[GE users] Advanced reservation for cluster outage?

s_kreidl sabine.kreidl at uibk.ac.at
Wed Jan 20 12:36:02 GMT 2010


Hi Edward,
I'd be absolutely interested in your way of handling this.

Many thanks,
Sabine

eddale schrieb:
> I went about this a different way.  Our cluster has a JSV setup that 
> adjusts the h_rt of the job so that it will end 5 minutes before a 
> cluster outage starts.  If this adjustment is made, the user is notified 
> with a prominent message to the console.  If you're interested in the 
> code, I could clean it up and post it.
>
> Cheers,
> Edward
>
> s_kreidl said the following on 11.08.09 11:01:
>   
>> I tried to do an advanced reservation to elegantly overcome an approaching cluster outage, but failed - if this is in general not the right approach for such a situation, please let me know how this is usually done. My major concern is to allow "backfilling" with jobs, which have an h_rt limit that would allow them to finish before the outage.
>>
>> We have SGE 6.2u2_1 installed.
>> We have two queues, all.q and par.q, both with imposed h_rt runtime limits (identical with the scheduler's default_duration).
>>
>> I managed to reserve the majority of slots on the cluster with the following command line:
>> qrsub -a 200908161700.00 -e 200908171700.00 -u test_user -q "*" -pe "openmpi*" 770-
>>
>> The resulting AR:
>> # qrstat -ar 92
>> id                             92
>> name                           NetApp
>> owner                          root
>> state                          w
>> start_time                     08/16/2009 17:00:00
>> end_time                       08/17/2009 17:00:00
>> duration                       24:00:00
>> submission_time                08/10/2009 14:45:15
>> group                          sge
>> account                        sge
>> granted_slots_list   par.q at n001=8,par.q at n003=8,...
>> granted_parallel_environment   openmpi* slots 770-9999999
>> acl_list                       test_user
>>
>> There are two things going wrong with respect to what I'm trying to do:
>>
>> 1. I can still submit all.q jobs with runtime limits too long to the reserved nodes. So, how do I reserve the whole cluster, rather than a queue, preferably within one single command line?
>>
>> 2. Jobs submitted to the par.q don't start, even if their runtime limit is well below the critical limit (I tried with -l h_rt=60). # qstat -g c" shows:
>> CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE
>> --------------------------------------------------------------------------------
>> all.q                             0.51     86      0    418    504      0      0
>> par.q                             1.00    624      0   -616      8      0      0
>>
>>
>> Thanks in advance for your help.
>> Best,
>> Sabine
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211798
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239804
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239932

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list