[GE users] (another) slotwise preemption question

cjf001 john.foley at motorola.com
Fri Aug 27 16:16:47 BST 2010


Thanks Daniel - I saw your post at

http://blogs.sun.com/templedf/entry/better_preemption

where I think you talk about this issue at the bottom.
So, it all comes back to getting the u6 binaries !

I did notice in the u6 release notes several "known
issues" with the slotwise suspension, but at first
glance they don't look like show-stoppers.

So, quick question - is the "P" state a new state that
you see in something like the qstat output ? I don't
see it on my "cheatsheet" of states.

    Thanks,

       John


dagru wrote:
> With 6.2 update 6 it was an enhancement (suspension
> prevention) introduced which fixes your issue.
> When a queue instance is "full" (in terms of: "the next
> job would be suspended") it goes into the preempted
> state (P). This means the qinstance is not considered
> by the scheduler anymore for further dispatching jobs
> into it. It searches a qinstance where it can let the
> job run immediately. If none is found, or other resource
> requests do not not match, it stays in qw.
>
> Daniel
>
>
> Am Donnerstag, den 26.08.2010, 16:22 -0500 schrieb cjf001:
>> Hi guys - here's a non-licensing question for you for a change :)
>>
>> I'm back into the depths of slotwise preemption, running
>> SGEv6.2u5 here on RHEL 5.2. I have 1 four-cpu (four slot)
>> machine I'm using for testing. I have 2 cluster queues -
>> "primary" and "secondary". "secondary" is subordinate to
>> "primary". My test job just sleeps for 4 minutes and then
>> dumps its environment.
>>
>> When I load up the machine with, say, 8 jobs in the secondary
>> queue, all is well - 4 jobs running, and 4 jobs waiting. Then
>> when I add *one* job into the primary queue, it suspends one
>> of the secondary jobs, as expected with slotwise preemption.
>> Now we have 4 jobs running, one suspended, and 4 waiting.
>>
>> If I use the "standard" suspension operation (no custom script),
>> the state of the jobs sits just like this until the primary
>> job completes - then the suspended job resumes - again, as
>> expected.
>>
>> However, we use a custom suspension script here that actually
>> qdel's the suspended job, because we don't like them lying around
>> on the execute hosts using up memory (we'll resubmit them
>> later). When I use this suspension method, it gets a little
>> weird.....
>>
>> What happens is that the suspended job disappears (from the qstat
>> output), as expected, since we killed it. So now we have 4 jobs
>> running (3 secondary and 1 primary), and 4 jobs waiting (all
>> secondary). But, for some reason, SGE isn't happy with that - it
>> tries to run one of the waiting jobs, even though all 4 slots are
>> full, and it's immediately suspended - so now we're back to 4 jobs
>> running and one suspended, with just 3 waiting now. We kill the
>> suspended job, and the same thing happens. Not what we were expecting....
>>
>> So, question is, why is SGE trying to push a 5th job onto
>> a machine that has only 4 slots, and all 4 are "busy" ? And, is
>> there a way around this ?
>>
>>      Thanks,
>>
>>       John
>>
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277373
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



-- 
###########################################################################
# John Foley                          # Location:  IL93-E1-21S            #
# IT & Systems Administration         # Maildrop:  IL93-E1-35O            #
# LV Simulation Cluster Support       #    Email: john.foley at motorola.com #
# Motorola, Inc. -  Mobile Devices    #    Phone: (847) 523-8719          #
# 600 North US Highway 45             #      Fax: (847) 523-5767          #
# Libertyville, IL. 60048  (USA)      #     Cell: (847) 460-8719          #
###########################################################################
               (this email sent using SeaMonkey on Windows)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277457

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list