[GE users] (another) slotwise preemption question

cjf001 john.foley at motorola.com
Mon Aug 30 16:35:06 BST 2010


OK, just finished testing this, and I can verify that my
original issue *is* fixed in 6.2u6. However, it looks
like there's still a bug there, which will probably sway
me from using it.

Setup is 6.2u6 running master and exec on same machine (for
testing), which is RHEL5.2, 4 cpu machine. 2 cluster queues,
"primary" and "secondary", "secondary" is subordinate to
"primary". Test job is just a 120 seconds sleep followed by
an environment dump.

Submit 10 jobs (an array job) into the secondary queue :

cjf001 at lxint05# qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
       8 100.19531 Dir-cjf001 cjf001       r     08/30/2010 09:54:51 secondary at lxzon43.srl.css.mot.     1 1
       8 100.19531 Dir-cjf001 cjf001       r     08/30/2010 09:54:51 secondary at lxzon43.srl.css.mot.     1 2
       8 100.19531 Dir-cjf001 cjf001       r     08/30/2010 09:54:51 secondary at lxzon43.srl.css.mot.     1 3
       8 100.19531 Dir-cjf001 cjf001       r     08/30/2010 09:54:51 secondary at lxzon43.srl.css.mot.     1 4
       8 100.19531 Dir-cjf001 cjf001       qw    08/30/2010 09:54:49                                    1 5-10:1

So far, so good. Now submit a single job into the primary
queue:

cjf001 at lxint05# qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
       8 100.19531 Dir-cjf001 cjf001       r     08/30/2010 09:54:51 secondary at lxzon43.srl.css.mot.     1 1
       8 100.19531 Dir-cjf001 cjf001       r     08/30/2010 09:54:51 secondary at lxzon43.srl.css.mot.     1 2
       8 100.19531 Dir-cjf001 cjf001       r     08/30/2010 09:54:51 secondary at lxzon43.srl.css.mot.     1 3
       8 100.19531 Dir-cjf001 cjf001       S     08/30/2010 09:54:51 secondary at lxzon43.srl.css.mot.     1 4
       9 93.35938 Dir-cjf001 cjf001       r     08/30/2010 09:55:36 primary at lxzon43.srl.css.mot.co     1
       8 100.19531 Dir-cjf001 cjf001       qw    08/30/2010 09:54:49                                    1 5-10:1

Still looks good - one secondary job (8.4) got suspended,
and the primary job is running. This much seems to work in u5,
too (for me at least - others had reported a bug with
this...)

Now I'll delete that suspended job "qdel 8.4" and I get :

cjf001 at lxint05# qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
       8 100.19531 Dir-cjf001 cjf001       r     08/30/2010 09:54:51 secondary at lxzon43.srl.css.mot.     1 1
       8 100.19531 Dir-cjf001 cjf001       r     08/30/2010 09:54:51 secondary at lxzon43.srl.css.mot.     1 2
       8 100.19531 Dir-cjf001 cjf001       r     08/30/2010 09:54:51 secondary at lxzon43.srl.css.mot.     1 3
       9 93.35938 Dir-cjf001 cjf001       r     08/30/2010 09:55:36 primary at lxzon43.srl.css.mot.co     1
       8 100.19531 Dir-cjf001 cjf001       qw    08/30/2010 09:54:49                                    1 5-10:1

Still good ! This is where u5 fell down, as it would allow one of the
waiting secondary jobs to begin running, and then suspend it immediately.
So that bug is fixed in u6.

However..... when the secondary jobs that are still running finish
(they finish at pretty much the same time, since they were all started
at the same time), I get :

cjf001 at lxint05# qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
       9 93.35938 Dir-cjf001 cjf001       r     08/30/2010 09:55:36 primary at lxzon43.srl.css.mot.co     1
       8 100.19531 Dir-cjf001 cjf001       qw    08/30/2010 09:54:49                                    1 5-10:1

which is OK, but then :

cjf001 at lxint05# qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
       8 100.19531 Dir-cjf001 cjf001       r     08/30/2010 09:57:06 secondary at lxzon43.srl.css.mot.     1 5
       8 100.19531 Dir-cjf001 cjf001       r     08/30/2010 09:57:06 secondary at lxzon43.srl.css.mot.     1 6
       8 100.19531 Dir-cjf001 cjf001       r     08/30/2010 09:57:06 secondary at lxzon43.srl.css.mot.     1 7
       8 100.19531 Dir-cjf001 cjf001       S     08/30/2010 09:57:06 secondary at lxzon43.srl.css.mot.     1 8
       9 93.35938 Dir-cjf001 cjf001       r     08/30/2010 09:55:36 primary at lxzon43.srl.css.mot.co     1
       8 100.19531 Dir-cjf001 cjf001       qw    08/30/2010 09:54:49                                    1 9,10


Oh no !  It scheduled *four* secondary jobs to run, instead of the *three*
that it should have scheduled, and then immediately suspended one. This is
not as expected. My co-scheduler would then delete and resubmit
job 8.8 unnecessarily. (When I did that in a subsequent test, which is
not shown, it didn't allow another secondary job to be scheduled and
then immediately suspended, which is good). So, there's no "black
hole" as Reuti described it, as before in u5, but it's still not
quite right.

So, bottom line for me, it's not worth switching to u6 with
this bug still present, the uncertainty over the licensing, and
the fact that the u5 qstat won't run with the u6 scheduler (I
modify the qstat output slightly to give me all the info I need
in a single run, so as to lighten the load on the qmaster a bit...
and I don't have the u6 qstat source to do that...)

Bummer. Mondays always get me down   :(

BTW, I never saw the "P" state show up, even though I ran this
test several times. Maybe I wasn't quick enough.

     John


John Foley wrote:
> Thanks Daniel - I saw your post at
>
> http://blogs.sun.com/templedf/entry/better_preemption
>
> where I think you talk about this issue at the bottom.
> So, it all comes back to getting the u6 binaries !
>
> I did notice in the u6 release notes several "known
> issues" with the slotwise suspension, but at first
> glance they don't look like show-stoppers.
>
> So, quick question - is the "P" state a new state that
> you see in something like the qstat output ? I don't
> see it on my "cheatsheet" of states.
>
> Thanks,
>
> John
>
>
> dagru wrote:
>> With 6.2 update 6 it was an enhancement (suspension
>> prevention) introduced which fixes your issue.
>> When a queue instance is "full" (in terms of: "the next
>> job would be suspended") it goes into the preempted
>> state (P). This means the qinstance is not considered
>> by the scheduler anymore for further dispatching jobs
>> into it. It searches a qinstance where it can let the
>> job run immediately. If none is found, or other resource
>> requests do not not match, it stays in qw.
>>
>> Daniel
>>
>>
>> Am Donnerstag, den 26.08.2010, 16:22 -0500 schrieb cjf001:
>>> Hi guys - here's a non-licensing question for you for a change :)
>>>
>>> I'm back into the depths of slotwise preemption, running
>>> SGEv6.2u5 here on RHEL 5.2. I have 1 four-cpu (four slot)
>>> machine I'm using for testing. I have 2 cluster queues -
>>> "primary" and "secondary". "secondary" is subordinate to
>>> "primary". My test job just sleeps for 4 minutes and then
>>> dumps its environment.
>>>
>>> When I load up the machine with, say, 8 jobs in the secondary
>>> queue, all is well - 4 jobs running, and 4 jobs waiting. Then
>>> when I add *one* job into the primary queue, it suspends one
>>> of the secondary jobs, as expected with slotwise preemption.
>>> Now we have 4 jobs running, one suspended, and 4 waiting.
>>>
>>> If I use the "standard" suspension operation (no custom script),
>>> the state of the jobs sits just like this until the primary
>>> job completes - then the suspended job resumes - again, as
>>> expected.
>>>
>>> However, we use a custom suspension script here that actually
>>> qdel's the suspended job, because we don't like them lying around
>>> on the execute hosts using up memory (we'll resubmit them
>>> later). When I use this suspension method, it gets a little
>>> weird.....
>>>
>>> What happens is that the suspended job disappears (from the qstat
>>> output), as expected, since we killed it. So now we have 4 jobs
>>> running (3 secondary and 1 primary), and 4 jobs waiting (all
>>> secondary). But, for some reason, SGE isn't happy with that - it
>>> tries to run one of the waiting jobs, even though all 4 slots are
>>> full, and it's immediately suspended - so now we're back to 4 jobs
>>> running and one suspended, with just 3 waiting now. We kill the
>>> suspended job, and the same thing happens. Not what we were
>>> expecting....
>>>
>>> So, question is, why is SGE trying to push a 5th job onto
>>> a machine that has only 4 slots, and all 4 are "busy" ? And, is
>>> there a way around this ?
>>>
>>> Thanks,
>>>
>>> John
>>>
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277373
>>
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>
>
>



--
###########################################################################
# John Foley                          # Location:  IL93-E1-21S            #
# IT & Systems Administration         # Maildrop:  IL93-E1-35O            #
# LV Simulation Cluster Support       #    Email: john.foley at motorola.com #
# Motorola, Inc. -  Mobile Devices    #    Phone: (847) 523-8719          #
# 600 North US Highway 45             #      Fax: (847) 523-5767          #
# Libertyville, IL. 60048  (USA)      #     Cell: (847) 460-8719          #
###########################################################################
               (this email sent using SeaMonkey on Windows)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=278234

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list