[GE users] job environment and checkpointing ?

cjf001 john.foley at motorola.com
Fri Feb 5 17:45:57 GMT 2010

Guys -

I have a question about the job environment....  SGEv6.2u2.

I have created a system (using SGE's suspend mechanism) that allows
a job to be killed, removed from the SGE system, and then resubmitted
as a completely new job if it is running in a queue that's set up
for suspensions (by being a subordinate queue). This is used for
jobs that can't be suspended by simply SIGSTOP'ing them and then
SIGCONT'ing them - usually MPI jobs.

Anyway, the jobs are resubmitted as the original user from the original
working directory, using the same submission (qsub) command. Unfortunately,
I'm finding that some users have set up some environmental variables that
the jobs need, and that I have not restored - therefore, the resubmitted
jobs fail.

Now, a guy here that knows something about SGE (ie, not your typical clueless
user ;) ) says that using a checkpoint will help get around this. I,
unfortunately, know practically nothing about checkpointing in SGE - I
mean, I understand the concept, but have never used it, so I don't know
the details.

So, my question is, is there some magic in SGE checkpointing that saves
and restores a job's environment ? If so, where would I find info on this ?
I'm just trying to keep from re-inventing the wheel in my quest to
restore the user's environment for my job resubmission, if I can, so I
thought I'd ask about this first.

    Thanks !



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list