[GE users] Rescheduled job joins output with originial job
lug2002 at med.cornell.edu
Fri Aug 14 22:49:10 BST 2009
> Am 14.08.2009 um 23:07 schrieb luisico:
> > Hi all,
> > I am new to SGE. We are currently testing its potential use in our
> > clusters. I came across a problem with rescheduling jobs when execd
> > is kill on the compute node. This is the scenario:
> > I submit a job, which starts at node A. Output from the job gets to
> > its default filename in an NFS mount home dir.
> > After some time I manually kill sge_execd on node A.
> Why did you do so?
Just playing, testing points of failure. We'll most probably install a few sge
instances in our cluster and would like to prepare for possible failures. Expect
a lot more questions in a near future ;-)
> > However the program started within the job is still running on node
> > A and the ouptut still gets to the job's stdout.
> > After some more time, qmaster changes the state of the queue
> > instance at node A to 'au' and moves the job to the pending list.
> > Output still gets to the job's stdout.
> For SGE the node crashed, as the execd is gone.
I understand execd is gone and cannot control the job anymore, but I thought that
since qmaster knows about it, it could do something about it as well. Guess not.
> > Now the job has been rescheduled and is running again on node B.
> > Now I get output from both jobs (original job on node A and
> > rescheduled job on node B) to the same output file.
> > I now restart sge_execd on node A and the program left over from
> > the original job gets automatically killed. Output to the jobs's
> > stdout is now only from the rescheduled job.
> > Is this the expected behavior? Shouldn't the job and its children
> > be completely removed before rescheduling the job?
> > As a workaround I guess I could redirected the output from my
> > program within the script to avoid the mixing.
> > By the way is there an option to not append the output of a job to
> > a previously existed file, ie each time a job is submitted with the
> > same output filename, it should replace it and not append.
> you can empty/remove the files in a queue prolog with lines like:
> rm $SGE_STDERR_PATH
> rm $SGE_STDOUT_PATH
> : > $SGE_STDERR_PATH
> : > $SGE_STDOUT_PATH
> -- Reuti
Why didn't I think about that? Thanks
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users