[GE users] Advanced reservation for cluster outage?

s_kreidl sabine.kreidl at uibk.ac.at
Mon Jan 25 09:07:27 GMT 2010


I submitted the AR with exactly the qrsub command line below (still 
going on with no changes). There was no special reason for the -q "*@*", 
I just wanted to get all of the available slots reserved, independent of 
the two existing queues- which obviously didn't work as intended.

The number of slots per host is limited to 8 for every execution host 
(complex_values slots=8,...).
In addition, it is limited in both queue configurations:
all.q:  *slots                 
0,[n001.uibk.ac.at=8],[n002.uibk.ac.at=8], \...*
par.q: *slots                 8*

Regards, Sabine

reuti schrieb:
> Am 22.01.2010 um 16:01 schrieb s_kreidl:
>
>   
>> Sorry to open this issue up again, but the AR is all of a sudden  
>> not working anymore as it did before.
>>
>> I can now submit jobs to the non-reserved all.q with runtime limits  
>> well exceeding the AR start time (and I'm pretty sure I tested this  
>> thoroughly before - got the "cannot run at host [...] due to a  
>> reservation" messages then).
>>
>> Parallel jobs on the other hand still don't get scheduled if they'd  
>> interfere with the AR - as intended.
>>
>> Any hints on what's going wrong here, or how I can use the AR to  
>> get a consistent reservation for all existing cluster queues?
>>     
>
> How did you submit the AR? One big AR which requests all slots from  
> the cluster like in your original post (BTW: any reason why you also  
> specified -q "*@*", I think it should work without)? The number of  
> slots per node is limited even when there are multiple queues per node?
>
> -- Reuti
>
>
>   
>> Thanks again in advance,
>> Sabine
>>
>>
>>
>>     
>>> Good to hear that handling cluster outages is an intended use of AR.
>>> And thanks for the hint on the existing RFE. I will consider  
>>> adding to
>>> that one, as soon as I am clear about what I'd actually want from  
>>> "qstat
>>> -j". However, I can absolutely confirm the unhelpful "PE offers  
>>> only 0
>>> slots" messages in my situation.
>>>
>>> Regards,
>>> Sabine
>>>
>>> reuti schrieb:
>>>       
>>>> Am 19.01.2010 um 17:21 schrieb s_kreidl:
>>>>
>>>>
>>>>         
>>>>> Hi Reuti,
>>>>>
>>>>> thanks for the quick reply. Yes, of course, qrstat is indeed the
>>>>> standard way of getting information about ARs.
>>>>>
>>>>> However, I find it a rather long way to go for a user, to look for
>>>>> ongoing advanced reservations because of a pending job, when  
>>>>> there are
>>>>> no hints in the "qstat -j" messages and also no hints from any  
>>>>> other
>>>>> qstat request. (And to be honest, I'm rather reluctant to write
>>>>> another
>>>>> piece of documentation for the rare occasions of cluster outages  
>>>>> for
>>>>> which we (mis-?)use the AR feature  ;-) ).
>>>>>
>>>>>           
>>>> No, it's an intended use IMO.
>>>>
>>>>
>>>>
>>>>         
>>>>> Don't you think some kind of RFE would be appropriate?
>>>>>
>>>>>           
>>>> There is already an RFE which you could extend:
>>>>
>>>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=224
>>>>
>>>> It's also the case that sometimes you see only that the PE offers
>>>> only 0 slots - but it's not easy to get the cause of this sometimes.
>>>> A qstat redesign (or better: its scheduler output) would be an
>>>> improvement.
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>
>>>>         
>>>>> Best,
>>>>> Sabine
>>>>>
>>>>> reuti schrieb:
>>>>>
>>>>>           
>>>>>> Hi,
>>>>>>
>>>>>> Am 19.01.2010 um 16:57 schrieb s_kreidl:
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> I somehow got the AR working as expected with SGE 6.2u3 (qrsub -a
>>>>>>> 01291200 -e 01291800 -pe "openmpi-8perhost" 1008 -q "*@*" -u
>>>>>>> my_user)
>>>>>>>
>>>>>>> The problem I encounter now, is that users have a hard time to  
>>>>>>> get
>>>>>>> to know anything about the existing AR:
>>>>>>>
>>>>>>> 1. "qhost -q" shows the reserved slots for one of the two queues
>>>>>>> (par.q) we have, but shows nothing for the other queue (all.q -
>>>>>>> historic reasons), for which the reservation obviously does have
>>>>>>> the desired consequences too.
>>>>>>>
>>>>>>> 2. "qstat -j" gives no hint on any ongoing reservation for  
>>>>>>> parallel
>>>>>>> pending jobs (only jobs explicitly sent to the "non-reserved"  
>>>>>>> queue
>>>>>>> all.q do show "cannot run at host [...] due to a reservation"
>>>>>>> messages)
>>>>>>>
>>>>>>> 3. "qstat -f" shows no reservation in the triple slot display of
>>>>>>> any queue instance
>>>>>>>
>>>>>>> 4. "qstat -g c" shows no reservation at all
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> does:
>>>>>>
>>>>>> $ qrstat -u "*"
>>>>>>
>>>>>> (note the r in qstat) help?
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> I do have two questions/concerns now:
>>>>>>>
>>>>>>> 1. Am I missing some standard procedure making ARs visible to the
>>>>>>> user as a reason for their pending jobs - is an update to 6.2u5
>>>>>>> necessary?
>>>>>>>
>>>>>>> 2. If not, I'd like to make an RFE of some kind, but as I
>>>>>>> understand too little about the internal workings of SGE and AR,
>>>>>>> I'd like to put this to discussion.
>>>>>>>
>>>>>>>
>>>>>>> Any thoughts would be much appreciated.
>>>>>>> Thanks,
>>>>>>> Sabine
>>>>>>>
>>>>>>> ------------------------------------------------------
>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>> dsForumId=38&dsMessageId=239747
>>>>>>>
>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>> dsForumId=38&dsMessageId=239748
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>> dsForumId=38&dsMessageId=239754
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>>           
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>>> dsForumId=38&dsMessageId=239798
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users- 
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>>
>>>>         
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=240390
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=240399
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=240843

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list