[GE users] Advanced reservation for cluster outage?

s_kreidl sabine.kreidl at uibk.ac.at
Fri Jan 22 15:01:00 GMT 2010


Sorry to open this issue up again, but the AR is all of a sudden not working anymore as it did before.

I can now submit jobs to the non-reserved all.q with runtime limits well exceeding the AR start time (and I'm pretty sure I tested this thoroughly before - got the "cannot run at host [...] due to a reservation" messages then). 

Parallel jobs on the other hand still don't get scheduled if they'd interfere with the AR - as intended.

Any hints on what's going wrong here, or how I can use the AR to get a consistent reservation for all existing cluster queues?

Thanks again in advance,
Sabine



> Good to hear that handling cluster outages is an intended use of AR.
> And thanks for the hint on the existing RFE. I will consider adding to 
> that one, as soon as I am clear about what I'd actually want from "qstat 
> -j". However, I can absolutely confirm the unhelpful "PE offers only 0 
> slots" messages in my situation.
> 
> Regards,
> Sabine
> 
> reuti schrieb:
> > Am 19.01.2010 um 17:21 schrieb s_kreidl:
> >
> >   
> >> Hi Reuti,
> >>
> >> thanks for the quick reply. Yes, of course, qrstat is indeed the
> >> standard way of getting information about ARs.
> >>
> >> However, I find it a rather long way to go for a user, to look for
> >> ongoing advanced reservations because of a pending job, when there are
> >> no hints in the "qstat -j" messages and also no hints from any other
> >> qstat request. (And to be honest, I'm rather reluctant to write  
> >> another
> >> piece of documentation for the rare occasions of cluster outages for
> >> which we (mis-?)use the AR feature  ;-) ).
> >>     
> >
> > No, it's an intended use IMO.
> >
> >
> >   
> >> Don't you think some kind of RFE would be appropriate?
> >>     
> >
> > There is already an RFE which you could extend:
> >
> > http://gridengine.sunsource.net/issues/show_bug.cgi?id=224
> >
> > It's also the case that sometimes you see only that the PE offers  
> > only 0 slots - but it's not easy to get the cause of this sometimes.  
> > A qstat redesign (or better: its scheduler output) would be an  
> > improvement.
> >
> > -- Reuti
> >
> >
> >   
> >> Best,
> >> Sabine
> >>
> >> reuti schrieb:
> >>     
> >>> Hi,
> >>>
> >>> Am 19.01.2010 um 16:57 schrieb s_kreidl:
> >>>
> >>>
> >>>       
> >>>> I somehow got the AR working as expected with SGE 6.2u3 (qrsub -a
> >>>> 01291200 -e 01291800 -pe "openmpi-8perhost" 1008 -q "*@*" -u  
> >>>> my_user)
> >>>>
> >>>> The problem I encounter now, is that users have a hard time to get
> >>>> to know anything about the existing AR:
> >>>>
> >>>> 1. "qhost -q" shows the reserved slots for one of the two queues
> >>>> (par.q) we have, but shows nothing for the other queue (all.q -
> >>>> historic reasons), for which the reservation obviously does have
> >>>> the desired consequences too.
> >>>>
> >>>> 2. "qstat -j" gives no hint on any ongoing reservation for parallel
> >>>> pending jobs (only jobs explicitly sent to the "non-reserved" queue
> >>>> all.q do show "cannot run at host [...] due to a reservation"
> >>>> messages)
> >>>>
> >>>> 3. "qstat -f" shows no reservation in the triple slot display of
> >>>> any queue instance
> >>>>
> >>>> 4. "qstat -g c" shows no reservation at all
> >>>>
> >>>>         
> >>> does:
> >>>
> >>> $ qrstat -u "*"
> >>>
> >>> (note the r in qstat) help?
> >>>
> >>> -- Reuti
> >>>
> >>>
> >>>       
> >>>> I do have two questions/concerns now:
> >>>>
> >>>> 1. Am I missing some standard procedure making ARs visible to the
> >>>> user as a reason for their pending jobs - is an update to 6.2u5
> >>>> necessary?
> >>>>
> >>>> 2. If not, I'd like to make an RFE of some kind, but as I
> >>>> understand too little about the internal workings of SGE and AR,
> >>>> I'd like to put this to discussion.
> >>>>
> >>>>
> >>>> Any thoughts would be much appreciated.
> >>>> Thanks,
> >>>> Sabine
> >>>>
> >>>> ------------------------------------------------------
> >>>> http://gridengine.sunsource.net/ds/viewMessage.do?
> >>>> dsForumId=38&dsMessageId=239747
> >>>>
> >>>> To unsubscribe from this discussion, e-mail: [users-
> >>>> unsubscribe at gridengine.sunsource.net].
> >>>>
> >>>>         
> >>> ------------------------------------------------------
> >>> http://gridengine.sunsource.net/ds/viewMessage.do? 
> >>> dsForumId=38&dsMessageId=239748
> >>>
> >>> To unsubscribe from this discussion, e-mail: [users- 
> >>> unsubscribe at gridengine.sunsource.net].
> >>>
> >>>
> >>>       
> >> ------------------------------------------------------
> >> http://gridengine.sunsource.net/ds/viewMessage.do? 
> >> dsForumId=38&dsMessageId=239754
> >>
> >> To unsubscribe from this discussion, e-mail: [users- 
> >> unsubscribe at gridengine.sunsource.net].
> >>     
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239798
> >
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> >
> >

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=240390

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list