[GE users] Advanced reservation for cluster outage?

reuti reuti at staff.uni-marburg.de
Tue Jan 26 00:15:06 GMT 2010


Am 25.01.2010 um 10:07 schrieb s_kreidl:

> I submitted the AR with exactly the qrsub command line below (still
> going on with no changes). There was no special reason for the -q  
> "*@*",
> I just wanted to get all of the available slots reserved,  
> independent of
> the two existing queues- which obviously didn't work as intended.
>
> The number of slots per host is limited to 8 for every execution host
> (complex_values slots=8,...).
> In addition, it is limited in both queue configurations:
> all.q:  *slots
> 0,[n001.uibk.ac.at=8],[n002.uibk.ac.at=8], \...*
> par.q: *slots                 8*
>
> Regards, Sabine
>
> reuti schrieb:
>> Am 22.01.2010 um 16:01 schrieb s_kreidl:
>>
>>
>>> Sorry to open this issue up again, but the AR is all of a sudden
>>> not working anymore as it did before.
>>>
>>> I can now submit jobs to the non-reserved all.q with runtime limits
>>> well exceeding the AR start time (and I'm pretty sure I tested this
>>> thoroughly before - got the "cannot run at host [...] due to a
>>> reservation" messages then).

You request h_rt explicitly in the qsub command?

-- Reuti


>>>
>>> Parallel jobs on the other hand still don't get scheduled if they'd
>>> interfere with the AR - as intended.
>>>
>>> Any hints on what's going wrong here, or how I can use the AR to
>>> get a consistent reservation for all existing cluster queues?
>>>
>>
>> How did you submit the AR? One big AR which requests all slots from
>> the cluster like in your original post (BTW: any reason why you also
>> specified -q "*@*", I think it should work without)? The number of
>> slots per node is limited even when there are multiple queues per  
>> node?
>>
>> -- Reuti
>>
>>
>>
>>> Thanks again in advance,
>>> Sabine
>>>
>>>
>>>
>>>
>>>> Good to hear that handling cluster outages is an intended use of  
>>>> AR.
>>>> And thanks for the hint on the existing RFE. I will consider
>>>> adding to
>>>> that one, as soon as I am clear about what I'd actually want from
>>>> "qstat
>>>> -j". However, I can absolutely confirm the unhelpful "PE offers
>>>> only 0
>>>> slots" messages in my situation.
>>>>
>>>> Regards,
>>>> Sabine
>>>>
>>>> reuti schrieb:
>>>>
>>>>> Am 19.01.2010 um 17:21 schrieb s_kreidl:
>>>>>
>>>>>
>>>>>
>>>>>> Hi Reuti,
>>>>>>
>>>>>> thanks for the quick reply. Yes, of course, qrstat is indeed the
>>>>>> standard way of getting information about ARs.
>>>>>>
>>>>>> However, I find it a rather long way to go for a user, to look  
>>>>>> for
>>>>>> ongoing advanced reservations because of a pending job, when
>>>>>> there are
>>>>>> no hints in the "qstat -j" messages and also no hints from any
>>>>>> other
>>>>>> qstat request. (And to be honest, I'm rather reluctant to write
>>>>>> another
>>>>>> piece of documentation for the rare occasions of cluster outages
>>>>>> for
>>>>>> which we (mis-?)use the AR feature  ;-) ).
>>>>>>
>>>>>>
>>>>> No, it's an intended use IMO.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Don't you think some kind of RFE would be appropriate?
>>>>>>
>>>>>>
>>>>> There is already an RFE which you could extend:
>>>>>
>>>>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=224
>>>>>
>>>>> It's also the case that sometimes you see only that the PE offers
>>>>> only 0 slots - but it's not easy to get the cause of this  
>>>>> sometimes.
>>>>> A qstat redesign (or better: its scheduler output) would be an
>>>>> improvement.
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Best,
>>>>>> Sabine
>>>>>>
>>>>>> reuti schrieb:
>>>>>>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 19.01.2010 um 16:57 schrieb s_kreidl:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I somehow got the AR working as expected with SGE 6.2u3  
>>>>>>>> (qrsub -a
>>>>>>>> 01291200 -e 01291800 -pe "openmpi-8perhost" 1008 -q "*@*" -u
>>>>>>>> my_user)
>>>>>>>>
>>>>>>>> The problem I encounter now, is that users have a hard time to
>>>>>>>> get
>>>>>>>> to know anything about the existing AR:
>>>>>>>>
>>>>>>>> 1. "qhost -q" shows the reserved slots for one of the two  
>>>>>>>> queues
>>>>>>>> (par.q) we have, but shows nothing for the other queue (all.q -
>>>>>>>> historic reasons), for which the reservation obviously does  
>>>>>>>> have
>>>>>>>> the desired consequences too.
>>>>>>>>
>>>>>>>> 2. "qstat -j" gives no hint on any ongoing reservation for
>>>>>>>> parallel
>>>>>>>> pending jobs (only jobs explicitly sent to the "non-reserved"
>>>>>>>> queue
>>>>>>>> all.q do show "cannot run at host [...] due to a reservation"
>>>>>>>> messages)
>>>>>>>>
>>>>>>>> 3. "qstat -f" shows no reservation in the triple slot  
>>>>>>>> display of
>>>>>>>> any queue instance
>>>>>>>>
>>>>>>>> 4. "qstat -g c" shows no reservation at all
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> does:
>>>>>>>
>>>>>>> $ qrstat -u "*"
>>>>>>>
>>>>>>> (note the r in qstat) help?
>>>>>>>
>>>>>>> -- Reuti
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I do have two questions/concerns now:
>>>>>>>>
>>>>>>>> 1. Am I missing some standard procedure making ARs visible  
>>>>>>>> to the
>>>>>>>> user as a reason for their pending jobs - is an update to 6.2u5
>>>>>>>> necessary?
>>>>>>>>
>>>>>>>> 2. If not, I'd like to make an RFE of some kind, but as I
>>>>>>>> understand too little about the internal workings of SGE and  
>>>>>>>> AR,
>>>>>>>> I'd like to put this to discussion.
>>>>>>>>
>>>>>>>>
>>>>>>>> Any thoughts would be much appreciated.
>>>>>>>> Thanks,
>>>>>>>> Sabine
>>>>>>>>
>>>>>>>> ------------------------------------------------------
>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>>> dsForumId=38&dsMessageId=239747
>>>>>>>>
>>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> ------------------------------------------------------
>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>> dsForumId=38&dsMessageId=239748
>>>>>>>
>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>> dsForumId=38&dsMessageId=239754
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>> dsForumId=38&dsMessageId=239798
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>>
>>>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=240390
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=240399
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=240843
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=240991

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list