[GE users] job environment and checkpointing ?

cjf001 john.foley at motorola.com
Fri Feb 5 22:19:43 GMT 2010


OK, thanks for the info.  Looks like I may have already
re-invented parts of the wheel, but that's OK - I know
how my wheel works :)

    Thanks again -

       John


reuti wrote:
> Hi,
>
> Am 05.02.2010 um 18:45 schrieb cjf001:
>
>> Guys -
>>
>> I have a question about the job environment....  SGEv6.2u2.
>>
>> (background....)
>> I have created a system (using SGE's suspend mechanism) that allows
>> a job to be killed, removed from the SGE system, and then resubmitted
>> as a completely new job if it is running in a queue that's set up
>> for suspensions (by being a subordinate queue). This is used for
>> jobs that can't be suspended by simply SIGSTOP'ing them and then
>> SIGCONT'ing them - usually MPI jobs.
>
> how are you doing this? When you use e.g. qresub of the suspended job
> (before you kill it) all settings should be the same.
>
>
>> Anyway, the jobs are resubmitted as the original user from the
>> original
>> working directory, using the same submission (qsub) command.
>> Unfortunately,
>> I'm finding that some users have set up some environmental
>> variables that
>> the jobs need, and that I have not restored - therefore, the
>> resubmitted
>> jobs fail.
>>
>> Now, a guy here that knows something about SGE (ie, not your
>> typical clueless
>> user ;) ) says that using a checkpoint will help get around this. I,
>> unfortunately, know practically nothing about checkpointing in SGE - I
>> mean, I understand the concept, but have never used it, so I don't
>> know
>> the details.
>>
>> (question....)
>> So, my question is, is there some magic in SGE checkpointing that
>> saves
>> and restores a job's environment ? If so, where would I find info
>> on this ?
>
> There is a Howto:
>
> http://gridengine.sunsource.net/howto/checkpointing.html and some
> state diagrams in:
>
> http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf
>
> When you set "when x" in the checkpointing environemt, the job will
> be rescheduled on a suspend automatically.
>
> SGE does not provide any checkpointing facility on its own, but
> supports an existing checkpointing mechanism which is already working
> outside of SGE. When you want to start the jobs always from the
> beginning, you can just use this feature for your purpose anyway.
>
> -- Reuti
>
>
>> I'm just trying to keep from re-inventing the wheel in my quest to
>> restore the user's environment for my job resubmission, if I can, so I
>> thought I'd ask about this first.
>>
>>      Thanks !
>>
>>        John
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=243533
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=243535
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=243567

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list