[GE users] Rescheduled job joins output with originial job

luisico lug2002 at med.cornell.edu
Fri Aug 14 22:49:10 BST 2009


> Am 14.08.2009 um 23:07 schrieb luisico:
> 
> > Hi all,
> >
> > I am new to SGE. We are currently testing its potential use in our  
> > clusters. I came across a problem with rescheduling jobs when execd  
> > is kill on the compute node. This is the scenario:
> >
> > I submit a job, which starts at node A. Output from the job gets to  
> > its default filename in an NFS mount home dir.
> >
> > After some time I manually kill sge_execd on node A.
> 
> Why did you do so?

Just playing, testing points of failure. We'll most probably install a few sge
instances in our cluster and would like to prepare for possible failures. Expect
a lot more questions in a near future ;-)

> 
> > However the program started within the job is still running on node  
> > A and the ouptut still gets to the job's stdout.
> >
> > After some more time, qmaster changes the state of the queue  
> > instance at node A to 'au' and moves the job to the pending list.  
> > Output still gets to the job's stdout.
> 
> For SGE the node crashed, as the execd is gone.

I understand execd is gone and cannot control the job anymore, but I thought that
since qmaster knows about it, it could do something about it as well. Guess not.

> >
> > Now the job has been rescheduled and is running again on node B.  
> > Now I get output from both jobs (original job on node A and  
> > rescheduled job on node B) to the same output file.
> >
> > I now restart sge_execd on node A and the program left over from  
> > the original job gets automatically killed. Output to the jobs's  
> > stdout is now only from the rescheduled job.
> >
> > Is this the expected behavior? Shouldn't the job and its children  
> > be completely removed before rescheduling the job?
> >
> > As a workaround I guess I could redirected the output from my  
> > program within the script to avoid the mixing.
> >
> > By the way is there an option to not append the output of a job to  
> > a previously existed file, ie each time a job is submitted with the  
> > same output filename, it should replace it and not append.
> 
> you can empty/remove the files in a queue prolog with lines like:
> 
> rm $SGE_STDERR_PATH
> rm $SGE_STDOUT_PATH
> 
> or
> 
> : > $SGE_STDERR_PATH
> : > $SGE_STDOUT_PATH

> -- Reuti

Why didn't I think about that? Thanks

Luis

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=212306

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list