[GE users] Can I stop backfilling?

Kevin Doman kdoman07 at gmail.com
Tue May 20 17:26:11 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Responding to Dan's suggestion -

First off, I'm using version 6.0u8.

There were four parallel jobs in the 'qw' state, each job has the h_rt
at 3600:00:00; In the sched_conf, I set default_duration at
3700:00:00. Following your suggestion, I submitted a sleep job from
the command line: qsub -pe mpi 20 -b y -R y sleep 360; Very strange,
as soon as I submit this job, I can see the immediate effect. All the
10-20 minutes jobs in the wait list stay put and the parallel jobs
started to run as soon as enough number of processors become avalable!

I will have to look further into this issue. But thanks so much for
your quick respond!

k.



On Tue, May 20, 2008 at 10:44 AM, Daniel Templeton
<Dan.Templeton at sun.com> wrote:
> Reuti wrote:
>>
>> Am 20.05.2008 um 17:15 schrieb Daniel Templeton:
>>
>>> If your jobs are starving, what you're seeing is not backfilling. :)
>>>  What version of SGE are you using?  There was (is?) a bug where the first
>>> RR job was ignored.  Submitting a second identical RR job, in that case,
>>> would then cause the scheduler to take notice and actually do the RR
>>> properly.
>>>
>>> By definition, backfilling cannot cause starvation, unless a backfilled
>>> job runs forever.
>>
>> Good point. What h_rt is requested by these short jobs? Otherwise the
>> default_duration will be taken (but not enforced) and this might lead to a
>> roll-over from one extending job (running longer than the estimated default
>> 10 minutes) to the next one and so onI fear.
>
> That will happen once, but not repeatedly.  After the first jobs overruns
> its estimated run time, the scheduler will treat the overrun is as ending
> "immediately" and won't do any more backfilling.
>
> Daniel
>
>>
>>>  BTW, when you say your jobs run 15-20 minutes, are they setting sort or
>>> hard run time limits?  If not, what is your default_duration?
>>>
>>> Daniel
>>>
>>> Kevin Doman wrote:
>>>>
>>>> We have a very busy cluster that always have thousands of short jobs
>>>> (15-20 minutes) in queue. Occasionally, a user come in and submit a 20
>>>> processor parallel job with h_rt=100 hours. While reservation is
>>>> enabled (-R y) and priority set to 1024, we continue to experience job
>>
>> max_reservation is also set up to a sensible value?
>>
>> -- Reuti
>>
>>
>>>> backfills which resulted in the same 'parallel job starvation' issue.
>>>>
>>>> Is it possible for me to stop backfilling altogether and let the
>>>> parallel jobs go first?
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list