[GE users] Advanced reservation for cluster outage?

reuti reuti at staff.uni-marburg.de
Tue Jan 26 08:25:25 GMT 2010


Am 26.01.2010 um 07:47 schrieb s_kreidl:

> No, but I have the default_duration set.

IIRC these were not honored in earlier versions when the AR feature  
was new (but I'm not sure about this). Which version are you using?

-- Reuti


> reuti schrieb:
>> Am 25.01.2010 um 10:07 schrieb s_kreidl:
>>
>>
>>> I submitted the AR with exactly the qrsub command line below (still
>>> going on with no changes). There was no special reason for the -q
>>> "*@*",
>>> I just wanted to get all of the available slots reserved,
>>> independent of
>>> the two existing queues- which obviously didn't work as intended.
>>>
>>> The number of slots per host is limited to 8 for every execution  
>>> host
>>> (complex_values slots=8,...).
>>> In addition, it is limited in both queue configurations:
>>> all.q:  *slots
>>> 0,[n001.uibk.ac.at=8],[n002.uibk.ac.at=8], \...*
>>> par.q: *slots                 8*
>>>
>>> Regards, Sabine
>>>
>>> reuti schrieb:
>>>
>>>> Am 22.01.2010 um 16:01 schrieb s_kreidl:
>>>>
>>>>
>>>>
>>>>> Sorry to open this issue up again, but the AR is all of a sudden
>>>>> not working anymore as it did before.
>>>>>
>>>>> I can now submit jobs to the non-reserved all.q with runtime  
>>>>> limits
>>>>> well exceeding the AR start time (and I'm pretty sure I tested  
>>>>> this
>>>>> thoroughly before - got the "cannot run at host [...] due to a
>>>>> reservation" messages then).
>>>>>
>>
>> You request h_rt explicitly in the qsub command?
>>
>> -- Reuti
>>
>>
>>
>>>>> Parallel jobs on the other hand still don't get scheduled if  
>>>>> they'd
>>>>> interfere with the AR - as intended.
>>>>>
>>>>> Any hints on what's going wrong here, or how I can use the AR to
>>>>> get a consistent reservation for all existing cluster queues?
>>>>>
>>>>>
>>>> How did you submit the AR? One big AR which requests all slots from
>>>> the cluster like in your original post (BTW: any reason why you  
>>>> also
>>>> specified -q "*@*", I think it should work without)? The number of
>>>> slots per node is limited even when there are multiple queues per
>>>> node?
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>
>>>>
>>>>> Thanks again in advance,
>>>>> Sabine
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Good to hear that handling cluster outages is an intended use of
>>>>>> AR.
>>>>>> And thanks for the hint on the existing RFE. I will consider
>>>>>> adding to
>>>>>> that one, as soon as I am clear about what I'd actually want from
>>>>>> "qstat
>>>>>> -j". However, I can absolutely confirm the unhelpful "PE offers
>>>>>> only 0
>>>>>> slots" messages in my situation.
>>>>>>
>>>>>> Regards,
>>>>>> Sabine
>>>>>>
>>>>>> reuti schrieb:
>>>>>>
>>>>>>
>>>>>>> Am 19.01.2010 um 17:21 schrieb s_kreidl:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi Reuti,
>>>>>>>>
>>>>>>>> thanks for the quick reply. Yes, of course, qrstat is indeed  
>>>>>>>> the
>>>>>>>> standard way of getting information about ARs.
>>>>>>>>
>>>>>>>> However, I find it a rather long way to go for a user, to look
>>>>>>>> for
>>>>>>>> ongoing advanced reservations because of a pending job, when
>>>>>>>> there are
>>>>>>>> no hints in the "qstat -j" messages and also no hints from any
>>>>>>>> other
>>>>>>>> qstat request. (And to be honest, I'm rather reluctant to write
>>>>>>>> another
>>>>>>>> piece of documentation for the rare occasions of cluster  
>>>>>>>> outages
>>>>>>>> for
>>>>>>>> which we (mis-?)use the AR feature  ;-) ).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> No, it's an intended use IMO.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Don't you think some kind of RFE would be appropriate?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> There is already an RFE which you could extend:
>>>>>>>
>>>>>>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=224
>>>>>>>
>>>>>>> It's also the case that sometimes you see only that the PE  
>>>>>>> offers
>>>>>>> only 0 slots - but it's not easy to get the cause of this
>>>>>>> sometimes.
>>>>>>> A qstat redesign (or better: its scheduler output) would be an
>>>>>>> improvement.
>>>>>>>
>>>>>>> -- Reuti
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Best,
>>>>>>>> Sabine
>>>>>>>>
>>>>>>>> reuti schrieb:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Am 19.01.2010 um 16:57 schrieb s_kreidl:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I somehow got the AR working as expected with SGE 6.2u3
>>>>>>>>>> (qrsub -a
>>>>>>>>>> 01291200 -e 01291800 -pe "openmpi-8perhost" 1008 -q "*@*" -u
>>>>>>>>>> my_user)
>>>>>>>>>>
>>>>>>>>>> The problem I encounter now, is that users have a hard  
>>>>>>>>>> time to
>>>>>>>>>> get
>>>>>>>>>> to know anything about the existing AR:
>>>>>>>>>>
>>>>>>>>>> 1. "qhost -q" shows the reserved slots for one of the two
>>>>>>>>>> queues
>>>>>>>>>> (par.q) we have, but shows nothing for the other queue  
>>>>>>>>>> (all.q -
>>>>>>>>>> historic reasons), for which the reservation obviously does
>>>>>>>>>> have
>>>>>>>>>> the desired consequences too.
>>>>>>>>>>
>>>>>>>>>> 2. "qstat -j" gives no hint on any ongoing reservation for
>>>>>>>>>> parallel
>>>>>>>>>> pending jobs (only jobs explicitly sent to the "non-reserved"
>>>>>>>>>> queue
>>>>>>>>>> all.q do show "cannot run at host [...] due to a reservation"
>>>>>>>>>> messages)
>>>>>>>>>>
>>>>>>>>>> 3. "qstat -f" shows no reservation in the triple slot
>>>>>>>>>> display of
>>>>>>>>>> any queue instance
>>>>>>>>>>
>>>>>>>>>> 4. "qstat -g c" shows no reservation at all
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> does:
>>>>>>>>>
>>>>>>>>> $ qrstat -u "*"
>>>>>>>>>
>>>>>>>>> (note the r in qstat) help?
>>>>>>>>>
>>>>>>>>> -- Reuti
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I do have two questions/concerns now:
>>>>>>>>>>
>>>>>>>>>> 1. Am I missing some standard procedure making ARs visible
>>>>>>>>>> to the
>>>>>>>>>> user as a reason for their pending jobs - is an update to  
>>>>>>>>>> 6.2u5
>>>>>>>>>> necessary?
>>>>>>>>>>
>>>>>>>>>> 2. If not, I'd like to make an RFE of some kind, but as I
>>>>>>>>>> understand too little about the internal workings of SGE and
>>>>>>>>>> AR,
>>>>>>>>>> I'd like to put this to discussion.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Any thoughts would be much appreciated.
>>>>>>>>>> Thanks,
>>>>>>>>>> Sabine
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------
>>>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>>>>> dsForumId=38&dsMessageId=239747
>>>>>>>>>>
>>>>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> ------------------------------------------------------
>>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>>>> dsForumId=38&dsMessageId=239748
>>>>>>>>>
>>>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> ------------------------------------------------------
>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>>> dsForumId=38&dsMessageId=239754
>>>>>>>>
>>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> ------------------------------------------------------
>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>> dsForumId=38&dsMessageId=239798
>>>>>>>
>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>> dsForumId=38&dsMessageId=240390
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=240399
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>>
>>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=240843
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=240991
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=241047
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=241060

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list