[GE users] (another) slotwise preemption question

reuti reuti at staff.uni-marburg.de
Thu Aug 26 22:58:08 BST 2010


Hi,

Am 26.08.2010 um 23:22 schrieb cjf001:

> Hi guys - here's a non-licensing question for you for a change :)
> 
> I'm back into the depths of slotwise preemption, running
> SGEv6.2u5 here on RHEL 5.2. I have 1 four-cpu (four slot)
> machine I'm using for testing. I have 2 cluster queues -
> "primary" and "secondary". "secondary" is subordinate to
> "primary". My test job just sleeps for 4 minutes and then
> dumps its environment.
> 
> When I load up the machine with, say, 8 jobs in the secondary
> queue, all is well - 4 jobs running, and 4 jobs waiting. Then
> when I add *one* job into the primary queue, it suspends one
> of the secondary jobs, as expected with slotwise preemption.
> Now we have 4 jobs running, one suspended, and 4 waiting.

ok.

> If I use the "standard" suspension operation (no custom script),
> the state of the jobs sits just like this until the primary
> job completes - then the suspended job resumes - again, as
> expected.

This I don't see, and it's a known bug:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=3233

But it will recover after some time, and will reach at one point the situation you described.


> However, we use a custom suspension script here that actually
> qdel's the suspended job, because we don't like them lying around
> on the execute hosts using up memory (we'll resubmit them
> later). When I use this suspension method, it gets a little
> weird.....
> 
> What happens is that the suspended job disappears (from the qstat
> output), as expected, since we killed it. So now we have 4 jobs
> running (3 secondary and 1 primary), and 4 jobs waiting (all
> secondary). But, for some reason, SGE isn't happy with that - it
> tries to run one of the waiting jobs, even though all 4 slots are
> full, and it's immediately suspended - so now we're back to 4 jobs
> running and one suspended, with just 3 waiting now. We kill the
> suspended job, and the same thing happens. Not what we were expecting....

Whoa - a black hole in the cluster.

Please file an issue, I can confirm this.


> So, question is, why is SGE trying to push a 5th job onto
> a machine that has only 4 slots, and all 4 are "busy" ? And, is
> there a way around this ?

What about using a checkpointing environment for the jobs in the secondary queue, where the suspension of the job will kill and requeue it (check-transparent will do already). You wouldn't need any special script like the one you used for the suspension right now.

Well, although the black hole is gone this way, one job is oscillating all the time when a checkpointing environment is used between "SR" and "Rq" states (with the schedule_interval period).

-- Reuti


>    Thanks,
> 
>     John
> 
> 
> -- 
> ###########################################################################
> # John Foley                          # Location:  IL93-E1-21S            #
> # IT & Systems Administration         # Maildrop:  IL93-E1-35O            #
> # LV Simulation Cluster Support       #    Email: john.foley at motorola.com #
> # Motorola, Inc. -  Mobile Devices    #    Phone: (847) 523-8719          #
> # 600 North US Highway 45             #      Fax: (847) 523-5767          #
> # Libertyville, IL. 60048  (USA)      #     Cell: (847) 460-8719          #
> ###########################################################################
>               (this email sent using SeaMonkey on Windows)
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277226
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277231

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list