[GE users] Reservation of resources in duration of checkpoing so that no other job can be able to use those resources.

reuti reuti at staff.uni-marburg.de
Wed Aug 18 10:55:50 BST 2010


Am 18.08.2010 um 10:18 schrieb sgerns:

> <snip>
> this is not exactly what I meant. You can trigger a checkpointing mechanism by SGE on its own, when your application already supports it. So: you didn't define any checkpointing environment in SGE, but doing it "by hand" from SGE's point of view, i.e. outside of SGE (`man sge_sckpt`and `man checkpoint`).
> Yes our application supports checkpointing & we are doing application level checkpointings for all application.there is one checkpointing environment we have created for every application.

aha - great. How is the migration triggered? You defined your checkpointing script under the entry for "migr_command"?

> > which will decide which job to checkpoint based on the priority & then we will send the checkpointing signal to those jobs automatically) : Nothing is manual we are not doing it by hand.
> > If these automating scripts (Wrapper of the job scripts) decides which job to checkpoint as shown below e.g. job 102 & job 100
> >
> > > 4. Suppose job id 100 & 102 I have selected for checkpointing & send checkpointing signal to those jobs, & It has taken 2 hours to chekpoint these jobs.
> > >    I do not want any other job to run on these resources during this duration of 2 hours.
> > >    Because there is a possibilty that small jobs can get the resources & start running
> >
> > You mean, you e.g. suspend these jobs, which will be checkpointed as an result? As SGE thinks the queue is free again, it might be used by other jobs. Why not suspend the normal.q at exehostXY
> I mentioned "suspend" here, as instead of suspending the job, it will get a signal when run with a checkpointing environment with proper setting.

Setting "when x" in the checkpointing environment should then suspend the job (it will show state "s" in `qstat`, but instead of being suspended by and "sigstop" the defined migration script will be invoked.

> <snip>
> ******************************************************************************************************************************
> Yes I got you now when we will "suspend" from qmon a checkpointing signal will be send if there is a chekpointing environment is created in sge.
> Now as you suggested that suspend the queue instances on which normal jobs running.
> 1. qmod -d normal.q at exehostXY


> 2. I have a doubt on this whether we can be able to do checkpointing once we will disable the normal queue instances.
>     qmod -d normal.q at exehost
>         then "suspend" for checkpoint( if checkpoint environment is set)
>        is this possible??

Why not? `qmod -sj <jobid>` after disabling the queue. Disabled queues just doesn't accept new jobs. Take care, this will start the migrate script only, and not the checkpoint script. The state disgram in http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf is the actual implementation, contrary to `man ckpt`.

> 3. Or other way round if we will checkpoint the job (Means send the "suspend" signal for checkpoining environmet) first & then can we be able to suspend the queue instances on which job is running.
>        "suspend" for checkpoint( if checkpoint environment is set)
>         qmod -d normal.q at exehost
>       or is this possible??

As long as the job is running (i.e. shows state "s" in `qstat` while in real it's performing its migrate_command), it's a matter of taste which order you prefer. The one I suggest is safer, as there can't be any race-condition.

-- Reuti

> *********************************************************************************************************************************************
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=274953
> >
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> >
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=275131
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list