[GE users] Rescheduled job joins output with originial job

luisico lug2002 at med.cornell.edu
Fri Aug 14 22:07:06 BST 2009

Hi all,

I am new to SGE. We are currently testing its potential use in our clusters. I came across a problem with rescheduling jobs when execd is kill on the compute node. This is the scenario:

I submit a job, which starts at node A. Output from the job gets to its default filename in an NFS mount home dir.

After some time I manually kill sge_execd on node A. However the program started within the job is still running on node A and the ouptut still gets to the job's stdout.

After some more time, qmaster changes the state of the queue instance at node A to 'au' and moves the job to the pending list. Output still gets to the job's stdout.

Now the job has been rescheduled and is running again on node B. Now I get output from both jobs (original job on node A and rescheduled job on node B) to the same output file.

I now restart sge_execd on node A and the program left over from the original job gets automatically killed. Output to the jobs's stdout is now only from the rescheduled job.

Is this the expected behavior? Shouldn't the job and its children be completely removed before rescheduling the job?

As a workaround I guess I could redirected the output from my program within the script to avoid the mixing.

By the way is there an option to not append the output of a job to a previously existed file, ie each time a job is submitted with the same output filename, it should replace it and not append.




To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list