[GE users] (another) slotwise preemption question

cjf001 john.foley at motorola.com
Fri Aug 27 17:25:32 BST 2010


Thanks - too many Dans and Daniels, I guess :)

    John


dagru wrote:
> Hi John,
>
> see inline
>
> Am Freitag, den 27.08.2010, 10:16 -0500 schrieb cjf001:
>> Thanks Daniel - I saw your post at
>>
>> http://blogs.sun.com/templedf/entry/better_preemption
>
> This is not my blog, I don't have any ;)
>
>> where I think you talk about this issue at the bottom.
>> So, it all comes back to getting the u6 binaries !
>>
>> I did notice in the u6 release notes several "known
>> issues" with the slotwise suspension, but at first
>> glance they don't look like show-stoppers.
>>
>> So, quick question - is the "P" state a new state that
>> you see in something like the qstat output ? I don't
>> see it on my "cheatsheet" of states.
>
> Yes, it is a new state in u6 and you can see it in the
> qstat output. It is very similar to the disabled state
> (D) but is only set in conjunction to the slotwise
> preemption.
>
> Daniel
>
>
>>
>>      Thanks,
>>
>>         John
>>
>>
>> dagru wrote:
>>> With 6.2 update 6 it was an enhancement (suspension
>>> prevention) introduced which fixes your issue.
>>> When a queue instance is "full" (in terms of: "the next
>>> job would be suspended") it goes into the preempted
>>> state (P). This means the qinstance is not considered
>>> by the scheduler anymore for further dispatching jobs
>>> into it. It searches a qinstance where it can let the
>>> job run immediately. If none is found, or other resource
>>> requests do not not match, it stays in qw.
>>>
>>> Daniel
>>>
>>>
>>> Am Donnerstag, den 26.08.2010, 16:22 -0500 schrieb cjf001:
>>>> Hi guys - here's a non-licensing question for you for a change :)
>>>>
>>>> I'm back into the depths of slotwise preemption, running
>>>> SGEv6.2u5 here on RHEL 5.2. I have 1 four-cpu (four slot)
>>>> machine I'm using for testing. I have 2 cluster queues -
>>>> "primary" and "secondary". "secondary" is subordinate to
>>>> "primary". My test job just sleeps for 4 minutes and then
>>>> dumps its environment.
>>>>
>>>> When I load up the machine with, say, 8 jobs in the secondary
>>>> queue, all is well - 4 jobs running, and 4 jobs waiting. Then
>>>> when I add *one* job into the primary queue, it suspends one
>>>> of the secondary jobs, as expected with slotwise preemption.
>>>> Now we have 4 jobs running, one suspended, and 4 waiting.
>>>>
>>>> If I use the "standard" suspension operation (no custom script),
>>>> the state of the jobs sits just like this until the primary
>>>> job completes - then the suspended job resumes - again, as
>>>> expected.
>>>>
>>>> However, we use a custom suspension script here that actually
>>>> qdel's the suspended job, because we don't like them lying around
>>>> on the execute hosts using up memory (we'll resubmit them
>>>> later). When I use this suspension method, it gets a little
>>>> weird.....
>>>>
>>>> What happens is that the suspended job disappears (from the qstat
>>>> output), as expected, since we killed it. So now we have 4 jobs
>>>> running (3 secondary and 1 primary), and 4 jobs waiting (all
>>>> secondary). But, for some reason, SGE isn't happy with that - it
>>>> tries to run one of the waiting jobs, even though all 4 slots are
>>>> full, and it's immediately suspended - so now we're back to 4 jobs
>>>> running and one suspended, with just 3 waiting now. We kill the
>>>> suspended job, and the same thing happens. Not what we were expecting....
>>>>
>>>> So, question is, why is SGE trying to push a 5th job onto
>>>> a machine that has only 4 slots, and all 4 are "busy" ? And, is
>>>> there a way around this ?
>>>>
>>>>       Thanks,
>>>>
>>>>        John
>>>>
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277373
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>>
>>
>> --
>> ###########################################################################
>> # John Foley                          # Location:  IL93-E1-21S            #
>> # IT&  Systems Administration         # Maildrop:  IL93-E1-35O            #
>> # LV Simulation Cluster Support       #    Email: john.foley at motorola.com #
>> # Motorola, Inc. -  Mobile Devices    #    Phone: (847) 523-8719          #
>> # 600 North US Highway 45             #      Fax: (847) 523-5767          #
>> # Libertyville, IL. 60048  (USA)      #     Cell: (847) 460-8719          #
>> ###########################################################################
>>                 (this email sent using SeaMonkey on Windows)
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277457
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>



-- 
###########################################################################
# John Foley                          # Location:  IL93-E1-21S            #
# IT & Systems Administration         # Maildrop:  IL93-E1-35O            #
# LV Simulation Cluster Support       #    Email: john.foley at motorola.com #
# Motorola, Inc. -  Mobile Devices    #    Phone: (847) 523-8719          #
# 600 North US Highway 45             #      Fax: (847) 523-5767          #
# Libertyville, IL. 60048  (USA)      #     Cell: (847) 460-8719          #
###########################################################################
               (this email sent using SeaMonkey on Windows)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277484

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list