[GE users] (another) slotwise preemption question

dagru daniel.x.gruber at oracle.com
Fri Aug 27 17:09:43 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi John,

see inline

Am Freitag, den 27.08.2010, 10:16 -0500 schrieb cjf001:
> Thanks Daniel - I saw your post at
> 
> http://blogs.sun.com/templedf/entry/better_preemption

This is not my blog, I don't have any ;)

> where I think you talk about this issue at the bottom.
> So, it all comes back to getting the u6 binaries !
> 
> I did notice in the u6 release notes several "known
> issues" with the slotwise suspension, but at first
> glance they don't look like show-stoppers.
> 
> So, quick question - is the "P" state a new state that
> you see in something like the qstat output ? I don't
> see it on my "cheatsheet" of states.

Yes, it is a new state in u6 and you can see it in the 
qstat output. It is very similar to the disabled state 
(D) but is only set in conjunction to the slotwise 
preemption. 

Daniel


> 
>     Thanks,
> 
>        John
> 
> 
> dagru wrote:
> > With 6.2 update 6 it was an enhancement (suspension
> > prevention) introduced which fixes your issue.
> > When a queue instance is "full" (in terms of: "the next
> > job would be suspended") it goes into the preempted
> > state (P). This means the qinstance is not considered
> > by the scheduler anymore for further dispatching jobs
> > into it. It searches a qinstance where it can let the
> > job run immediately. If none is found, or other resource
> > requests do not not match, it stays in qw.
> >
> > Daniel
> >
> >
> > Am Donnerstag, den 26.08.2010, 16:22 -0500 schrieb cjf001:
> >> Hi guys - here's a non-licensing question for you for a change :)
> >>
> >> I'm back into the depths of slotwise preemption, running
> >> SGEv6.2u5 here on RHEL 5.2. I have 1 four-cpu (four slot)
> >> machine I'm using for testing. I have 2 cluster queues -
> >> "primary" and "secondary". "secondary" is subordinate to
> >> "primary". My test job just sleeps for 4 minutes and then
> >> dumps its environment.
> >>
> >> When I load up the machine with, say, 8 jobs in the secondary
> >> queue, all is well - 4 jobs running, and 4 jobs waiting. Then
> >> when I add *one* job into the primary queue, it suspends one
> >> of the secondary jobs, as expected with slotwise preemption.
> >> Now we have 4 jobs running, one suspended, and 4 waiting.
> >>
> >> If I use the "standard" suspension operation (no custom script),
> >> the state of the jobs sits just like this until the primary
> >> job completes - then the suspended job resumes - again, as
> >> expected.
> >>
> >> However, we use a custom suspension script here that actually
> >> qdel's the suspended job, because we don't like them lying around
> >> on the execute hosts using up memory (we'll resubmit them
> >> later). When I use this suspension method, it gets a little
> >> weird.....
> >>
> >> What happens is that the suspended job disappears (from the qstat
> >> output), as expected, since we killed it. So now we have 4 jobs
> >> running (3 secondary and 1 primary), and 4 jobs waiting (all
> >> secondary). But, for some reason, SGE isn't happy with that - it
> >> tries to run one of the waiting jobs, even though all 4 slots are
> >> full, and it's immediately suspended - so now we're back to 4 jobs
> >> running and one suspended, with just 3 waiting now. We kill the
> >> suspended job, and the same thing happens. Not what we were expecting....
> >>
> >> So, question is, why is SGE trying to push a 5th job onto
> >> a machine that has only 4 slots, and all 4 are "busy" ? And, is
> >> there a way around this ?
> >>
> >>      Thanks,
> >>
> >>       John
> >>
> >>
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277373
> >
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> 
> 
> -- 
> ###########################################################################
> # John Foley                          # Location:  IL93-E1-21S            #
> # IT & Systems Administration         # Maildrop:  IL93-E1-35O            #
> # LV Simulation Cluster Support       #    Email: john.foley at motorola.com #
> # Motorola, Inc. -  Mobile Devices    #    Phone: (847) 523-8719          #
> # 600 North US Highway 45             #      Fax: (847) 523-5767          #
> # Libertyville, IL. 60048  (USA)      #     Cell: (847) 460-8719          #
> ###########################################################################
>                (this email sent using SeaMonkey on Windows)
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277457
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- -
Daniel Gruber | Software Engineer
Phone: +49 (0)941 3075-128  (x60128)
ORACLE Grid Engine Engineering
ORACLE Deutschland B.V. & Co. KG | Dr.-Leo-Ritter-Str. 7 | D-93049
Regensburg

ORACLE Deutschland B.V. & Co. KG
Hauptverwaltung: Riesstr. 25, D-80992 München
Registergericht: Amtsgericht München, HRA 95603

Komplementärin: ORACLE Deutschland Verwaltung B.V.
Rijnzathe 6, 3454PV De Meern, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Geschäftsführer: Jürgen Kunz, Marcel van de Molen, Alexander van der Ven

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277481

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list