[GE users] (another) slotwise preemption question

dagru daniel.x.gruber at oracle.com
Fri Aug 27 10:44:03 BST 2010

With 6.2 update 6 it was an enhancement (suspension 
prevention) introduced which fixes your issue. 
When a queue instance is "full" (in terms of: "the next 
job would be suspended") it goes into the preempted 
state (P). This means the qinstance is not considered 
by the scheduler anymore for further dispatching jobs 
into it. It searches a qinstance where it can let the 
job run immediately. If none is found, or other resource
requests do not not match, it stays in qw.


Am Donnerstag, den 26.08.2010, 16:22 -0500 schrieb cjf001:
> Hi guys - here's a non-licensing question for you for a change :)
> I'm back into the depths of slotwise preemption, running
> SGEv6.2u5 here on RHEL 5.2. I have 1 four-cpu (four slot)
> machine I'm using for testing. I have 2 cluster queues -
> "primary" and "secondary". "secondary" is subordinate to
> "primary". My test job just sleeps for 4 minutes and then
> dumps its environment.
> When I load up the machine with, say, 8 jobs in the secondary
> queue, all is well - 4 jobs running, and 4 jobs waiting. Then
> when I add *one* job into the primary queue, it suspends one
> of the secondary jobs, as expected with slotwise preemption.
> Now we have 4 jobs running, one suspended, and 4 waiting.
> If I use the "standard" suspension operation (no custom script),
> the state of the jobs sits just like this until the primary
> job completes - then the suspended job resumes - again, as
> expected.
> However, we use a custom suspension script here that actually
> qdel's the suspended job, because we don't like them lying around
> on the execute hosts using up memory (we'll resubmit them
> later). When I use this suspension method, it gets a little
> weird.....
> What happens is that the suspended job disappears (from the qstat
> output), as expected, since we killed it. So now we have 4 jobs
> running (3 secondary and 1 primary), and 4 jobs waiting (all
> secondary). But, for some reason, SGE isn't happy with that - it
> tries to run one of the waiting jobs, even though all 4 slots are
> full, and it's immediately suspended - so now we're back to 4 jobs
> running and one suspended, with just 3 waiting now. We kill the
> suspended job, and the same thing happens. Not what we were expecting....
> So, question is, why is SGE trying to push a 5th job onto
> a machine that has only 4 slots, and all 4 are "busy" ? And, is
> there a way around this ?
>     Thanks,
>      John


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list