[GE users] Advanced reservation for cluster outage?

s_kreidl sabine.kreidl at uibk.ac.at
Tue Jan 26 16:50:42 GMT 2010


It's GE 6.2u3. The default_duration is respected for jobs submitted
without any h_rt, s_rt limits to par.q.

reuti schrieb:
> Am 26.01.2010 um 07:47 schrieb s_kreidl:
>
>
>> No, but I have the default_duration set.
>>
>
> IIRC these were not honored in earlier versions when the AR feature
> was new (but I'm not sure about this). Which version are you using?
>
> -- Reuti
>
>
>
>> reuti schrieb:
>>
>>> Am 25.01.2010 um 10:07 schrieb s_kreidl:
>>>
>>>
>>>
>>>> I submitted the AR with exactly the qrsub command line below (still
>>>> going on with no changes). There was no special reason for the -q
>>>> "*@*",
>>>> I just wanted to get all of the available slots reserved,
>>>> independent of
>>>> the two existing queues- which obviously didn't work as intended.
>>>>
>>>> The number of slots per host is limited to 8 for every execution
>>>> host
>>>> (complex_values slots=8,...).
>>>> In addition, it is limited in both queue configurations:
>>>> all.q:  *slots
>>>> 0,[n001.uibk.ac.at=8],[n002.uibk.ac.at=8], \...*
>>>> par.q: *slots                 8*
>>>>
>>>> Regards, Sabine
>>>>
>>>> reuti schrieb:
>>>>
>>>>
>>>>> Am 22.01.2010 um 16:01 schrieb s_kreidl:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Sorry to open this issue up again, but the AR is all of a sudden
>>>>>> not working anymore as it did before.
>>>>>>
>>>>>> I can now submit jobs to the non-reserved all.q with runtime
>>>>>> limits
>>>>>> well exceeding the AR start time (and I'm pretty sure I tested
>>>>>> this
>>>>>> thoroughly before - got the "cannot run at host [...] due to a
>>>>>> reservation" messages then).
>>>>>>
>>>>>>
>>> You request h_rt explicitly in the qsub command?
>>>
>>> -- Reuti
>>>
>>>
>>>
>>>
>>>>>> Parallel jobs on the other hand still don't get scheduled if
>>>>>> they'd
>>>>>> interfere with the AR - as intended.
>>>>>>
>>>>>> Any hints on what's going wrong here, or how I can use the AR to
>>>>>> get a consistent reservation for all existing cluster queues?
>>>>>>
>>>>>>
>>>>>>
>>>>> How did you submit the AR? One big AR which requests all slots from
>>>>> the cluster like in your original post (BTW: any reason why you
>>>>> also
>>>>> specified -q "*@*", I think it should work without)? The number of
>>>>> slots per node is limited even when there are multiple queues per
>>>>> node?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Thanks again in advance,
>>>>>> Sabine
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Good to hear that handling cluster outages is an intended use of
>>>>>>> AR.
>>>>>>> And thanks for the hint on the existing RFE. I will consider
>>>>>>> adding to
>>>>>>> that one, as soon as I am clear about what I'd actually want from
>>>>>>> "qstat
>>>>>>> -j". However, I can absolutely confirm the unhelpful "PE offers
>>>>>>> only 0
>>>>>>> slots" messages in my situation.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Sabine
>>>>>>>
>>>>>>> reuti schrieb:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Am 19.01.2010 um 17:21 schrieb s_kreidl:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi Reuti,
>>>>>>>>>
>>>>>>>>> thanks for the quick reply. Yes, of course, qrstat is indeed
>>>>>>>>> the
>>>>>>>>> standard way of getting information about ARs.
>>>>>>>>>
>>>>>>>>> However, I find it a rather long way to go for a user, to look
>>>>>>>>> for
>>>>>>>>> ongoing advanced reservations because of a pending job, when
>>>>>>>>> there are
>>>>>>>>> no hints in the "qstat -j" messages and also no hints from any
>>>>>>>>> other
>>>>>>>>> qstat request. (And to be honest, I'm rather reluctant to write
>>>>>>>>> another
>>>>>>>>> piece of documentation for the rare occasions of cluster
>>>>>>>>> outages
>>>>>>>>> for
>>>>>>>>> which we (mis-?)use the AR feature  ;-) ).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> No, it's an intended use IMO.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Don't you think some kind of RFE would be appropriate?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> There is already an RFE which you could extend:
>>>>>>>>
>>>>>>>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=224
>>>>>>>>
>>>>>>>> It's also the case that sometimes you see only that the PE
>>>>>>>> offers
>>>>>>>> only 0 slots - but it's not easy to get the cause of this
>>>>>>>> sometimes.
>>>>>>>> A qstat redesign (or better: its scheduler output) would be an
>>>>>>>> improvement.
>>>>>>>>
>>>>>>>> -- Reuti
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Sabine
>>>>>>>>>
>>>>>>>>> reuti schrieb:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Am 19.01.2010 um 16:57 schrieb s_kreidl:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I somehow got the AR working as expected with SGE 6.2u3
>>>>>>>>>>> (qrsub -a
>>>>>>>>>>> 01291200 -e 01291800 -pe "openmpi-8perhost" 1008 -q "*@*" -u
>>>>>>>>>>> my_user)
>>>>>>>>>>>
>>>>>>>>>>> The problem I encounter now, is that users have a hard
>>>>>>>>>>> time to
>>>>>>>>>>> get
>>>>>>>>>>> to know anything about the existing AR:
>>>>>>>>>>>
>>>>>>>>>>> 1. "qhost -q" shows the reserved slots for one of the two
>>>>>>>>>>> queues
>>>>>>>>>>> (par.q) we have, but shows nothing for the other queue
>>>>>>>>>>> (all.q -
>>>>>>>>>>> historic reasons), for which the reservation obviously does
>>>>>>>>>>> have
>>>>>>>>>>> the desired consequences too.
>>>>>>>>>>>
>>>>>>>>>>> 2. "qstat -j" gives no hint on any ongoing reservation for
>>>>>>>>>>> parallel
>>>>>>>>>>> pending jobs (only jobs explicitly sent to the "non-reserved"
>>>>>>>>>>> queue
>>>>>>>>>>> all.q do show "cannot run at host [...] due to a reservation"
>>>>>>>>>>> messages)
>>>>>>>>>>>
>>>>>>>>>>> 3. "qstat -f" shows no reservation in the triple slot
>>>>>>>>>>> display of
>>>>>>>>>>> any queue instance
>>>>>>>>>>>
>>>>>>>>>>> 4. "qstat -g c" shows no reservation at all
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> does:
>>>>>>>>>>
>>>>>>>>>> $ qrstat -u "*"
>>>>>>>>>>
>>>>>>>>>> (note the r in qstat) help?
>>>>>>>>>>
>>>>>>>>>> -- Reuti
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I do have two questions/concerns now:
>>>>>>>>>>>
>>>>>>>>>>> 1. Am I missing some standard procedure making ARs visible
>>>>>>>>>>> to the
>>>>>>>>>>> user as a reason for their pending jobs - is an update to
>>>>>>>>>>> 6.2u5
>>>>>>>>>>> necessary?
>>>>>>>>>>>
>>>>>>>>>>> 2. If not, I'd like to make an RFE of some kind, but as I
>>>>>>>>>>> understand too little about the internal workings of SGE and
>>>>>>>>>>> AR,
>>>>>>>>>>> I'd like to put this to discussion.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Any thoughts would be much appreciated.
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Sabine
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------
>>>>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>>>>>> dsForumId=38&dsMessageId=239747
>>>>>>>>>>>
>>>>>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------
>>>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>>>>> dsForumId=38&dsMessageId=239748
>>>>>>>>>>
>>>>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> ------------------------------------------------------
>>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>>>> dsForumId=38&dsMessageId=239754
>>>>>>>>>
>>>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> ------------------------------------------------------
>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>>> dsForumId=38&dsMessageId=239798
>>>>>>>>
>>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>> dsForumId=38&dsMessageId=240390
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>>>
>>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>> dsForumId=38&dsMessageId=240399
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>>
>>>>>
>>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=240843
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=240991
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>>
>>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=241047
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=241060
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=241132

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list