[GE users] Preemption vs dedicating nodes by group?

Reuti reuti at staff.uni-marburg.de
Thu Apr 28 23:35:20 BST 2005

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Jim,

Quoting "Marconnet, James E Mr /Computer Sciences Corporation" 
<james.marconnet at smdc.army.mil>:

> Justus:
> Thanks for sharing your approach. Sounds promising. Hope it works for you
> and others.
> I like the idea of having the secondary que open to all users instead of
> having to maintain specific secondary ques for each and every group. Or did
> I understand that correctly?
> "may be looking at" suggests that you have it somewhat or all set up, but
> not necessarily fully tested yet. Does this work as described? 
> You also mentioned possible file corruption... Is this just a worry, or
> have
> you run into it?
> You sound like you are using 5.3p6. I don't know if there is any
> substantial
> difference between that and 6.0u3 in these matters. Anyone?
> Sounds like it is the subordination of the secondary que instance to the
> primary que that triggers your checkpointing of specific jobs running on
> specific nodes. From my reading so far, it's not clear to me that secondary
> que subordination by the primary que would result in specific job
> suspension, then leading to the checkpointing. Subordination seems to only
> affect assignment of additional jobs to a node using a que, not affecting
> currently running jobs. Perhaps if you allow node oversubscription and then
> have a suspend-limit set, it might occur. Can you clarify how you actually
> trigger the checkpointing? Is it somehow triggered by the -ckpt in the
> qsub?

in case, that a checkpointing interface (proper setup with "when x") was 
requested in qsub, any suspension of the job (whether by hand or by any type of 
threshold) will not suspend the job, but instead trigger requeuing.

> What happens if the user does not include that option?

Their jobs will be suspended as usual.

> I've read about checkpointing, and have partially set this scheme up on
> several test ques using qmon, but I don't have it far enough along to test.
> There is a place in qmon to enter checkpoint, migration, restart, and clean
> commands. Are these commands something that SGE already has and uses unless
> you put something different here, or are these for scripts that the user
> needs to write and to store somewhere (on the individual nodes?) Are these
> scripts specific to the jobs, or generic? Sorry, this is a little unclear
> to
> me in the documentation.

There is a Howto:


Cheers - Reuti

> I'll bet explaining this scheme, how it works, and why to add that extra
> -ckpt option to the users was interesting.
> Hopefully by the time you see and can reply to this, tired won't be an
> issue!
> Anyone doing something like this or different to accomplish the same
> objectives have anything to share?
> Thanks!
> Jim Marconnet
> -----Original Message-----
> From: Justus Loerke [mailto:loerke at molgen.mpg.de] 
> Sent: Thursday, April 28, 2005 11:44 AM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Preemption vs dedicating nodes by group?
> Hm, I think I may be looking at some similar setup to balance load between
> two user groups... I just configured primary/secondary queues like this:
> primary queues open only to specified user lists via user access in the
> queue configuration (giving some kind of dedicated queues), secondary
> queues
> open to all users, but I added checkpointing to kill & restart jobs on
> secondary queues instead of suspending or disabling them. Jobs are
> submitted
> without specification of the queue, I let GE worry about sending the jobs
> to
> the right nodes. (Note: I have one primary and secondary queue for every
> single machine, is it possible to configure multi-host queues with 5.3p6? )
> Sec. queues are configured subordinated to primary, so once the primary
> queue is empty, jobs will be started on secondary queues, but will be
> terminated immediately (and eventually restarted) once a job in the
> dedicated primary queue is started.
> The setup is something like this:
> -configure primary and secondary queues, secondary subordinated.
> -add userdefined checkpointing scheme, add the restart on suspend option
> and
> add all secondary queues for this checkpointing scheme -add "-ckpt scheme"
> to the qsub command This way I don't even need a load sensor to check queue
> contents (what I looked at first), like Reuti suggested previously. In
> effect, this kills secondary jobs once a primary one comes along to take
> the
> dedicated queue. This may be troublesome if you're worried about file
> corruption by killed jobs (like I am, since so far we haven't really been
> using checkpointing on any level at all), but maybe this can be solved on
> the scripting level (but I haven't looked into that yet).
> I hope I'm making some sense, been a long day ;)
> Cheers,
> Justus
> --
> -------------------------------------------
> Dipl. Phys. Justus Loerke
> - UltraStrukturNetzwerk -
> Max Planck Institute for Molecular Genetics Ihnestr. 63-73
> D-14195 Berlin
> Tel.:   +49-30-8413-1644
> Fax:    +49-30-8413-1385
> E-mail: loerke at molgen.mpg.de  
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list