[GE users] Advanced reservation for cluster outage?

reuti reuti at staff.uni-marburg.de
Fri Jan 22 15:27:26 GMT 2010


Am 22.01.2010 um 16:01 schrieb s_kreidl:

> Sorry to open this issue up again, but the AR is all of a sudden  
> not working anymore as it did before.
>
> I can now submit jobs to the non-reserved all.q with runtime limits  
> well exceeding the AR start time (and I'm pretty sure I tested this  
> thoroughly before - got the "cannot run at host [...] due to a  
> reservation" messages then).
>
> Parallel jobs on the other hand still don't get scheduled if they'd  
> interfere with the AR - as intended.
>
> Any hints on what's going wrong here, or how I can use the AR to  
> get a consistent reservation for all existing cluster queues?

How did you submit the AR? One big AR which requests all slots from  
the cluster like in your original post (BTW: any reason why you also  
specified -q "*@*", I think it should work without)? The number of  
slots per node is limited even when there are multiple queues per node?

-- Reuti


> Thanks again in advance,
> Sabine
>
>
>
>> Good to hear that handling cluster outages is an intended use of AR.
>> And thanks for the hint on the existing RFE. I will consider  
>> adding to
>> that one, as soon as I am clear about what I'd actually want from  
>> "qstat
>> -j". However, I can absolutely confirm the unhelpful "PE offers  
>> only 0
>> slots" messages in my situation.
>>
>> Regards,
>> Sabine
>>
>> reuti schrieb:
>>> Am 19.01.2010 um 17:21 schrieb s_kreidl:
>>>
>>>
>>>> Hi Reuti,
>>>>
>>>> thanks for the quick reply. Yes, of course, qrstat is indeed the
>>>> standard way of getting information about ARs.
>>>>
>>>> However, I find it a rather long way to go for a user, to look for
>>>> ongoing advanced reservations because of a pending job, when  
>>>> there are
>>>> no hints in the "qstat -j" messages and also no hints from any  
>>>> other
>>>> qstat request. (And to be honest, I'm rather reluctant to write
>>>> another
>>>> piece of documentation for the rare occasions of cluster outages  
>>>> for
>>>> which we (mis-?)use the AR feature  ;-) ).
>>>>
>>>
>>> No, it's an intended use IMO.
>>>
>>>
>>>
>>>> Don't you think some kind of RFE would be appropriate?
>>>>
>>>
>>> There is already an RFE which you could extend:
>>>
>>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=224
>>>
>>> It's also the case that sometimes you see only that the PE offers
>>> only 0 slots - but it's not easy to get the cause of this sometimes.
>>> A qstat redesign (or better: its scheduler output) would be an
>>> improvement.
>>>
>>> -- Reuti
>>>
>>>
>>>
>>>> Best,
>>>> Sabine
>>>>
>>>> reuti schrieb:
>>>>
>>>>> Hi,
>>>>>
>>>>> Am 19.01.2010 um 16:57 schrieb s_kreidl:
>>>>>
>>>>>
>>>>>
>>>>>> I somehow got the AR working as expected with SGE 6.2u3 (qrsub -a
>>>>>> 01291200 -e 01291800 -pe "openmpi-8perhost" 1008 -q "*@*" -u
>>>>>> my_user)
>>>>>>
>>>>>> The problem I encounter now, is that users have a hard time to  
>>>>>> get
>>>>>> to know anything about the existing AR:
>>>>>>
>>>>>> 1. "qhost -q" shows the reserved slots for one of the two queues
>>>>>> (par.q) we have, but shows nothing for the other queue (all.q -
>>>>>> historic reasons), for which the reservation obviously does have
>>>>>> the desired consequences too.
>>>>>>
>>>>>> 2. "qstat -j" gives no hint on any ongoing reservation for  
>>>>>> parallel
>>>>>> pending jobs (only jobs explicitly sent to the "non-reserved"  
>>>>>> queue
>>>>>> all.q do show "cannot run at host [...] due to a reservation"
>>>>>> messages)
>>>>>>
>>>>>> 3. "qstat -f" shows no reservation in the triple slot display of
>>>>>> any queue instance
>>>>>>
>>>>>> 4. "qstat -g c" shows no reservation at all
>>>>>>
>>>>>>
>>>>> does:
>>>>>
>>>>> $ qrstat -u "*"
>>>>>
>>>>> (note the r in qstat) help?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>
>>>>>> I do have two questions/concerns now:
>>>>>>
>>>>>> 1. Am I missing some standard procedure making ARs visible to the
>>>>>> user as a reason for their pending jobs - is an update to 6.2u5
>>>>>> necessary?
>>>>>>
>>>>>> 2. If not, I'd like to make an RFE of some kind, but as I
>>>>>> understand too little about the internal workings of SGE and AR,
>>>>>> I'd like to put this to discussion.
>>>>>>
>>>>>>
>>>>>> Any thoughts would be much appreciated.
>>>>>> Thanks,
>>>>>> Sabine
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>> dsForumId=38&dsMessageId=239747
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>> dsForumId=38&dsMessageId=239748
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>>
>>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=239754
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessageId=239798
>>>
>>> To unsubscribe from this discussion, e-mail: [users- 
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=240390
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=240399

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list