[GE users] Advanced reservation for cluster outage?

s_kreidl sabine.kreidl at uibk.ac.at
Tue Jan 26 06:47:50 GMT 2010


No, but I have the default_duration set.

reuti schrieb:
> Am 25.01.2010 um 10:07 schrieb s_kreidl:
>
>   
>> I submitted the AR with exactly the qrsub command line below (still
>> going on with no changes). There was no special reason for the -q  
>> "*@*",
>> I just wanted to get all of the available slots reserved,  
>> independent of
>> the two existing queues- which obviously didn't work as intended.
>>
>> The number of slots per host is limited to 8 for every execution host
>> (complex_values slots=8,...).
>> In addition, it is limited in both queue configurations:
>> all.q:  *slots
>> 0,[n001.uibk.ac.at=8],[n002.uibk.ac.at=8], \...*
>> par.q: *slots                 8*
>>
>> Regards, Sabine
>>
>> reuti schrieb:
>>     
>>> Am 22.01.2010 um 16:01 schrieb s_kreidl:
>>>
>>>
>>>       
>>>> Sorry to open this issue up again, but the AR is all of a sudden
>>>> not working anymore as it did before.
>>>>
>>>> I can now submit jobs to the non-reserved all.q with runtime limits
>>>> well exceeding the AR start time (and I'm pretty sure I tested this
>>>> thoroughly before - got the "cannot run at host [...] due to a
>>>> reservation" messages then).
>>>>         
>
> You request h_rt explicitly in the qsub command?
>
> -- Reuti
>
>
>   
>>>> Parallel jobs on the other hand still don't get scheduled if they'd
>>>> interfere with the AR - as intended.
>>>>
>>>> Any hints on what's going wrong here, or how I can use the AR to
>>>> get a consistent reservation for all existing cluster queues?
>>>>
>>>>         
>>> How did you submit the AR? One big AR which requests all slots from
>>> the cluster like in your original post (BTW: any reason why you also
>>> specified -q "*@*", I think it should work without)? The number of
>>> slots per node is limited even when there are multiple queues per  
>>> node?
>>>
>>> -- Reuti
>>>
>>>
>>>
>>>       
>>>> Thanks again in advance,
>>>> Sabine
>>>>
>>>>
>>>>
>>>>
>>>>         
>>>>> Good to hear that handling cluster outages is an intended use of  
>>>>> AR.
>>>>> And thanks for the hint on the existing RFE. I will consider
>>>>> adding to
>>>>> that one, as soon as I am clear about what I'd actually want from
>>>>> "qstat
>>>>> -j". However, I can absolutely confirm the unhelpful "PE offers
>>>>> only 0
>>>>> slots" messages in my situation.
>>>>>
>>>>> Regards,
>>>>> Sabine
>>>>>
>>>>> reuti schrieb:
>>>>>
>>>>>           
>>>>>> Am 19.01.2010 um 17:21 schrieb s_kreidl:
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Hi Reuti,
>>>>>>>
>>>>>>> thanks for the quick reply. Yes, of course, qrstat is indeed the
>>>>>>> standard way of getting information about ARs.
>>>>>>>
>>>>>>> However, I find it a rather long way to go for a user, to look  
>>>>>>> for
>>>>>>> ongoing advanced reservations because of a pending job, when
>>>>>>> there are
>>>>>>> no hints in the "qstat -j" messages and also no hints from any
>>>>>>> other
>>>>>>> qstat request. (And to be honest, I'm rather reluctant to write
>>>>>>> another
>>>>>>> piece of documentation for the rare occasions of cluster outages
>>>>>>> for
>>>>>>> which we (mis-?)use the AR feature  ;-) ).
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> No, it's an intended use IMO.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Don't you think some kind of RFE would be appropriate?
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> There is already an RFE which you could extend:
>>>>>>
>>>>>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=224
>>>>>>
>>>>>> It's also the case that sometimes you see only that the PE offers
>>>>>> only 0 slots - but it's not easy to get the cause of this  
>>>>>> sometimes.
>>>>>> A qstat redesign (or better: its scheduler output) would be an
>>>>>> improvement.
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Best,
>>>>>>> Sabine
>>>>>>>
>>>>>>> reuti schrieb:
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Am 19.01.2010 um 16:57 schrieb s_kreidl:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> I somehow got the AR working as expected with SGE 6.2u3  
>>>>>>>>> (qrsub -a
>>>>>>>>> 01291200 -e 01291800 -pe "openmpi-8perhost" 1008 -q "*@*" -u
>>>>>>>>> my_user)
>>>>>>>>>
>>>>>>>>> The problem I encounter now, is that users have a hard time to
>>>>>>>>> get
>>>>>>>>> to know anything about the existing AR:
>>>>>>>>>
>>>>>>>>> 1. "qhost -q" shows the reserved slots for one of the two  
>>>>>>>>> queues
>>>>>>>>> (par.q) we have, but shows nothing for the other queue (all.q -
>>>>>>>>> historic reasons), for which the reservation obviously does  
>>>>>>>>> have
>>>>>>>>> the desired consequences too.
>>>>>>>>>
>>>>>>>>> 2. "qstat -j" gives no hint on any ongoing reservation for
>>>>>>>>> parallel
>>>>>>>>> pending jobs (only jobs explicitly sent to the "non-reserved"
>>>>>>>>> queue
>>>>>>>>> all.q do show "cannot run at host [...] due to a reservation"
>>>>>>>>> messages)
>>>>>>>>>
>>>>>>>>> 3. "qstat -f" shows no reservation in the triple slot  
>>>>>>>>> display of
>>>>>>>>> any queue instance
>>>>>>>>>
>>>>>>>>> 4. "qstat -g c" shows no reservation at all
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> does:
>>>>>>>>
>>>>>>>> $ qrstat -u "*"
>>>>>>>>
>>>>>>>> (note the r in qstat) help?
>>>>>>>>
>>>>>>>> -- Reuti
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> I do have two questions/concerns now:
>>>>>>>>>
>>>>>>>>> 1. Am I missing some standard procedure making ARs visible  
>>>>>>>>> to the
>>>>>>>>> user as a reason for their pending jobs - is an update to 6.2u5
>>>>>>>>> necessary?
>>>>>>>>>
>>>>>>>>> 2. If not, I'd like to make an RFE of some kind, but as I
>>>>>>>>> understand too little about the internal workings of SGE and  
>>>>>>>>> AR,
>>>>>>>>> I'd like to put this to discussion.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Any thoughts would be much appreciated.
>>>>>>>>> Thanks,
>>>>>>>>> Sabine
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------
>>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>>>> dsForumId=38&dsMessageId=239747
>>>>>>>>>
>>>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> ------------------------------------------------------
>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>>> dsForumId=38&dsMessageId=239748
>>>>>>>>
>>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> ------------------------------------------------------
>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>> dsForumId=38&dsMessageId=239754
>>>>>>>
>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>> dsForumId=38&dsMessageId=239798
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=240390
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>>         
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessageId=240399
>>>
>>> To unsubscribe from this discussion, e-mail: [users- 
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>>
>>>       
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=240843
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=240991
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=241047

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list