[GE users] job environment and checkpointing ?

cjf001 john.foley at motorola.com
Fri Feb 5 22:19:43 GMT 2010

OK, thanks for the info.  Looks like I may have already
re-invented parts of the wheel, but that's OK - I know
how my wheel works :)

    Thanks again -


reuti wrote:
> Hi,
> Am 05.02.2010 um 18:45 schrieb cjf001:
>> Guys -
>> I have a question about the job environment....  SGEv6.2u2.
>> (background....)
>> I have created a system (using SGE's suspend mechanism) that allows
>> a job to be killed, removed from the SGE system, and then resubmitted
>> as a completely new job if it is running in a queue that's set up
>> for suspensions (by being a subordinate queue). This is used for
>> jobs that can't be suspended by simply SIGSTOP'ing them and then
>> SIGCONT'ing them - usually MPI jobs.
> how are you doing this? When you use e.g. qresub of the suspended job
> (before you kill it) all settings should be the same.
>> Anyway, the jobs are resubmitted as the original user from the
>> original
>> working directory, using the same submission (qsub) command.
>> Unfortunately,
>> I'm finding that some users have set up some environmental
>> variables that
>> the jobs need, and that I have not restored - therefore, the
>> resubmitted
>> jobs fail.
>> Now, a guy here that knows something about SGE (ie, not your
>> typical clueless
>> user ;) ) says that using a checkpoint will help get around this. I,
>> unfortunately, know practically nothing about checkpointing in SGE - I
>> mean, I understand the concept, but have never used it, so I don't
>> know
>> the details.
>> (question....)
>> So, my question is, is there some magic in SGE checkpointing that
>> saves
>> and restores a job's environment ? If so, where would I find info
>> on this ?
> There is a Howto:
> http://gridengine.sunsource.net/howto/checkpointing.html and some
> state diagrams in:
> http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf
> When you set "when x" in the checkpointing environemt, the job will
> be rescheduled on a suspend automatically.
> SGE does not provide any checkpointing facility on its own, but
> supports an existing checkpointing mechanism which is already working
> outside of SGE. When you want to start the jobs always from the
> beginning, you can just use this feature for your purpose anyway.
> -- Reuti
>> I'm just trying to keep from re-inventing the wheel in my quest to
>> restore the user's environment for my job resubmission, if I can, so I
>> thought I'd ask about this first.
>>      Thanks !
>>        John
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=243533
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=243535
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list