[GE users] how do you use -dl with qrsh? (qrsh will flag an error for the value of the flag, but if the value is valid it disowns the flag)

reuti reuti at staff.uni-marburg.de
Wed Dec 9 19:18:06 GMT 2009


Am 09.12.2009 um 19:22 schrieb bdbaddog:

> Reuti,
>
> On Wed, Dec 9, 2009 at 3:11 AM, reuti <reuti at staff.uni-marburg.de>  
> wrote:
>> Am 09.12.2009 um 03:37 schrieb bdbaddog:
>>
>>> Reuti,
>>>
>>> On Tue, Dec 8, 2009 at 6:10 PM, reuti <reuti at staff.uni-marburg.de>
>>> wrote:
>>>> Hi Bill,
>>>>
>>>> Am 09.12.2009 um 00:47 schrieb bdbaddog:
>>>>
>>>>> I'm running 6.2U4
>>>>>
>>>>> I beleive the argument to -dl here is valid, but then it errors  
>>>>> with
>>>>> -dl not known
>>>>>
>>>>> qrsh -dl 12081339 -P lp -l s_rt=1200 -l h_rt=1380 -l coupons=2 -
>>>>> now n
>>>>> -nostdin -cwd -N blah  /bin/sh
>>>>> /home/bdeegan/work/main-plain/output/test/blah.sh
>>>>> error: Unknown option -dl
>>>>>
>>>>>
>>>>> Here I expect it's invalid, but it admits there's a -dl flag.
>>>>>
>>>>> qrsh -dl 1339 -P lp -l s_rt=1200 -l h_rt=1380 -l coupons=2 -now n
>>>>> -nostdin -cwd -N blah /bin/sh
>>>>> /home/bdeegan/work/main-plain/output/test/blah.sh
>>>>>
>>>>> Invalid format of date/hour-minute field.
>>>>> error: ERROR! Wrong date/time format "1339" specified to -dl  
>>>>> option
>>>>>
>>>>>
>>>>> Any idea what's going on here?
>>>>
>>>> it seems, that SGE is first checking the format of the supplied
>>>> values. When it's fine, it discovers thereafter that it's not a  
>>>> valid
>>>> option for qrsh at all. There is either a bug in the  
>>>> documentation or
>>>> the behavior of qrsh. I suggest to file a bug.
>>>>
>>>> I'm not sure, whether a deadline is best for an interactive job  
>>>> while
>>>> you are waiting for the results in front of the terminal, as  
>>>> only the
>>>> priority up to the given date will rise - there is no guarantee  
>>>> that
>>>> it wil start for sure at that time. When you want something to run
>>>> for sure, it's best to submit an advance reservation first - when
>>>> it's granted the slots are reserved for your job. Then you can  
>>>> submit
>>>> the actual (interactive) job into this advance reservation.
>>>>
>>>> In SGE, an interactive job is more handled like an immediate job by
>>>> default.
>>>>
>>>> $ qrsh ... => will run in an interactive queue
>>>>
>>>> $ qsub ... => will run in an batch queue
>>>>
>>>> But:
>>>>
>>>> $ qrsh -now no ... => will run in a batch queue
>>>>
>>>> $ qsub -now yes ... => will run in an interactive queue
>>>>
>>>> So, to keep you qrsh job hanging around until the advance  
>>>> reservation
>>>> starts, you need:
>>>>
>>>> $ qrsh -now no -ar 1234 ...
>>>
>>> For historical reasons, the scripts I'm working on use qrsh in order
>>> to wait for the completion of what is effectively a batch job.
>>> They do use -now no.
>>>
>>> The jobs will be asking for some consumable resources (Max for a  
>>> give
>>> type of host, as we're not yet running 6.2U3 or above) and can't use
>>> exclusive access. The idea was to use -dl to have the priority  
>>> bump up
>>> and insure at some point the job would be able to get exclusive  
>>> access
>>> to a node.
>>
>> Then I would suggest to use an  urgency policy with an attached
>> complex for this followup job, so that this job is more important
>> than others. I assume you use already -hold_jid, and as long as the
>> predecessor isn't finished, it won't reserve anything. As soon as the
>> first job finishes, the followup job will be on top of the waiting  
>> list.
>
>
> We trickle up to 70 jobs per run per user via qrsh at a time (limited
> by our scripting), no "-hold_jid", but "-now no" is on the command
> line.
> We're trying to setup a mechanism to run benchmarks, insuring the same
> node type and no other jobs on the machine.  Because we're running
> 6.2u1, and don't have a maintenance window in the near term to upgrade
> to 6.2u4, I'm looking for a way to enable this.
>
> We current have a consumable resource on each node "coupons" which
> represents the number of GB of RAM each machine has (I did see the
> recent thread on a better way to implement this, and we'll be using
> that for our next rev of cluster config), with each job requesting
> it's expected memory footprint, to prevent oversubscribing memory.
>
> To ensure the benchmark jobs get a machine to themselves the plan is
> to request the max # of coupons for the node type the benchmark will
> run on.  I was hoping to use -dl to ensure the test job doesn't get
> resource starved and never get dispatched, and not have to change the
> waiting time weight.
>
> Any guidance you have on how to achieve these goals would be most  
> helpful.
> I'm hoping we have a window in late January to update to the latest
> SGE at that point.

Aha, I got you in the wrong way when reading "wait for the completion".

When the jobs aren't waiting on the completion of another job, it  
sounds more like you need resource reservation switched on (-R y) for  
the jobs and enabled in the scheduler (entry max_reservations).

Best would be if you can supply an estimated rutime for the jobs with  
(-l h_rt=...) which would kill the job of course when it runs longer.  
This would then enable backfilling.

-- Reuti


>
> Thanks,
> Bill
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=232479
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=232489

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list