[GE users] job environment and checkpointing ?

reuti reuti at staff.uni-marburg.de
Fri Feb 5 18:10:39 GMT 2010


Hi,

Am 05.02.2010 um 18:45 schrieb cjf001:

> Guys -
>
> I have a question about the job environment....  SGEv6.2u2.
>
> (background....)
> I have created a system (using SGE's suspend mechanism) that allows
> a job to be killed, removed from the SGE system, and then resubmitted
> as a completely new job if it is running in a queue that's set up
> for suspensions (by being a subordinate queue). This is used for
> jobs that can't be suspended by simply SIGSTOP'ing them and then
> SIGCONT'ing them - usually MPI jobs.

how are you doing this? When you use e.g. qresub of the suspended job  
(before you kill it) all settings should be the same.


> Anyway, the jobs are resubmitted as the original user from the  
> original
> working directory, using the same submission (qsub) command.  
> Unfortunately,
> I'm finding that some users have set up some environmental  
> variables that
> the jobs need, and that I have not restored - therefore, the  
> resubmitted
> jobs fail.
>
> Now, a guy here that knows something about SGE (ie, not your  
> typical clueless
> user ;) ) says that using a checkpoint will help get around this. I,
> unfortunately, know practically nothing about checkpointing in SGE - I
> mean, I understand the concept, but have never used it, so I don't  
> know
> the details.
>
> (question....)
> So, my question is, is there some magic in SGE checkpointing that  
> saves
> and restores a job's environment ? If so, where would I find info  
> on this ?

There is a Howto:

http://gridengine.sunsource.net/howto/checkpointing.html and some  
state diagrams in:

http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf

When you set "when x" in the checkpointing environemt, the job will  
be rescheduled on a suspend automatically.

SGE does not provide any checkpointing facility on its own, but  
supports an existing checkpointing mechanism which is already working  
outside of SGE. When you want to start the jobs always from the  
beginning, you can just use this feature for your purpose anyway.

-- Reuti


> I'm just trying to keep from re-inventing the wheel in my quest to
> restore the user's environment for my job resubmission, if I can, so I
> thought I'd ask about this first.
>
>     Thanks !
>
>       John
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=243533
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=243535

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list