[GE users] Advanced reservation for cluster outage?

eddale eddale at cs.unc.edu
Tue Jan 19 21:33:54 GMT 2010


I went about this a different way.  Our cluster has a JSV setup that 
adjusts the h_rt of the job so that it will end 5 minutes before a 
cluster outage starts.  If this adjustment is made, the user is notified 
with a prominent message to the console.  If you're interested in the 
code, I could clean it up and post it.

Cheers,
Edward

s_kreidl said the following on 11.08.09 11:01:
> I tried to do an advanced reservation to elegantly overcome an approaching cluster outage, but failed - if this is in general not the right approach for such a situation, please let me know how this is usually done. My major concern is to allow "backfilling" with jobs, which have an h_rt limit that would allow them to finish before the outage.
>
> We have SGE 6.2u2_1 installed.
> We have two queues, all.q and par.q, both with imposed h_rt runtime limits (identical with the scheduler's default_duration).
>
> I managed to reserve the majority of slots on the cluster with the following command line:
> qrsub -a 200908161700.00 -e 200908171700.00 -u test_user -q "*" -pe "openmpi*" 770-
>
> The resulting AR:
> # qrstat -ar 92
> id                             92
> name                           NetApp
> owner                          root
> state                          w
> start_time                     08/16/2009 17:00:00
> end_time                       08/17/2009 17:00:00
> duration                       24:00:00
> submission_time                08/10/2009 14:45:15
> group                          sge
> account                        sge
> granted_slots_list   par.q at n001=8,par.q at n003=8,...
> granted_parallel_environment   openmpi* slots 770-9999999
> acl_list                       test_user
>
> There are two things going wrong with respect to what I'm trying to do:
>
> 1. I can still submit all.q jobs with runtime limits too long to the reserved nodes. So, how do I reserve the whole cluster, rather than a queue, preferably within one single command line?
>
> 2. Jobs submitted to the par.q don't start, even if their runtime limit is well below the critical limit (I tried with -l h_rt=60). # qstat -g c" shows:
> CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE
> --------------------------------------------------------------------------------
> all.q                             0.51     86      0    418    504      0      0
> par.q                             1.00    624      0   -616      8      0      0
>
>
> Thanks in advance for your help.
> Best,
> Sabine
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211798
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239804

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list