[GE users] Rescheduled job joins output with originial job
reuti at staff.uni-marburg.de
Fri Aug 14 22:17:34 BST 2009
Am 14.08.2009 um 23:07 schrieb luisico:
> Hi all,
> I am new to SGE. We are currently testing its potential use in our
> clusters. I came across a problem with rescheduling jobs when execd
> is kill on the compute node. This is the scenario:
> I submit a job, which starts at node A. Output from the job gets to
> its default filename in an NFS mount home dir.
> After some time I manually kill sge_execd on node A.
Why did you do so?
> However the program started within the job is still running on node
> A and the ouptut still gets to the job's stdout.
> After some more time, qmaster changes the state of the queue
> instance at node A to 'au' and moves the job to the pending list.
> Output still gets to the job's stdout.
For SGE the node crashed, as the execd is gone.
> Now the job has been rescheduled and is running again on node B.
> Now I get output from both jobs (original job on node A and
> rescheduled job on node B) to the same output file.
> I now restart sge_execd on node A and the program left over from
> the original job gets automatically killed. Output to the jobs's
> stdout is now only from the rescheduled job.
> Is this the expected behavior? Shouldn't the job and its children
> be completely removed before rescheduling the job?
> As a workaround I guess I could redirected the output from my
> program within the script to avoid the mixing.
> By the way is there an option to not append the output of a job to
> a previously existed file, ie each time a job is submitted with the
> same output filename, it should replace it and not append.
you can empty/remove the files in a queue prolog with lines like:
: > $SGE_STDERR_PATH
: > $SGE_STDOUT_PATH
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users