[GE users] Rescheduled job joins output with originial job

reuti reuti at staff.uni-marburg.de
Fri Aug 14 22:17:34 BST 2009

Am 14.08.2009 um 23:07 schrieb luisico:

> Hi all,
> I am new to SGE. We are currently testing its potential use in our  
> clusters. I came across a problem with rescheduling jobs when execd  
> is kill on the compute node. This is the scenario:
> I submit a job, which starts at node A. Output from the job gets to  
> its default filename in an NFS mount home dir.
> After some time I manually kill sge_execd on node A.

Why did you do so?

> However the program started within the job is still running on node  
> A and the ouptut still gets to the job's stdout.
> After some more time, qmaster changes the state of the queue  
> instance at node A to 'au' and moves the job to the pending list.  
> Output still gets to the job's stdout.

For SGE the node crashed, as the execd is gone.

> Now the job has been rescheduled and is running again on node B.  
> Now I get output from both jobs (original job on node A and  
> rescheduled job on node B) to the same output file.
> I now restart sge_execd on node A and the program left over from  
> the original job gets automatically killed. Output to the jobs's  
> stdout is now only from the rescheduled job.
> Is this the expected behavior? Shouldn't the job and its children  
> be completely removed before rescheduling the job?
> As a workaround I guess I could redirected the output from my  
> program within the script to avoid the mixing.
> By the way is there an option to not append the output of a job to  
> a previously existed file, ie each time a job is submitted with the  
> same output filename, it should replace it and not append.

you can empty/remove the files in a queue prolog with lines like:




-- Reuti


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list