[GE users] Rescheduled job joins output with originial job

templedf dan.templeton at sun.com
Fri Aug 14 22:17:52 BST 2009


That is indeed expected behavior.  If the execd is down, the master has 
no way to do anything on that node.  This is way the rerunnable 
attribute is something that the job or queue has to request.  It may not 
be appropriate for all jobs.

Do you expect that your execds will be dying often?  That's not exactly 
a common occurrence, at least not without losing the entire execution 
node.  You could always set up a watchdog process to make sure that if 
the execd goes down, it comes right back up again.  Once the execd comes 
back up, the master will resolve the issue with duplicate jobs.

The over-engineered solution is to use a custom starter method or prolog 
to change the job's name before it runs.  That way, if it's rescheduled, 
it will start writing to a different file.

Daniel

luisico wrote:
> Hi all,
>
> I am new to SGE. We are currently testing its potential use in our clusters. I came across a problem with rescheduling jobs when execd is kill on the compute node. This is the scenario:
>
> I submit a job, which starts at node A. Output from the job gets to its default filename in an NFS mount home dir.
>
> After some time I manually kill sge_execd on node A. However the program started within the job is still running on node A and the ouptut still gets to the job's stdout.
>
> After some more time, qmaster changes the state of the queue instance at node A to 'au' and moves the job to the pending list. Output still gets to the job's stdout.
>
> Now the job has been rescheduled and is running again on node B. Now I get output from both jobs (original job on node A and rescheduled job on node B) to the same output file.
>
> I now restart sge_execd on node A and the program left over from the original job gets automatically killed. Output to the jobs's stdout is now only from the rescheduled job.
>
> Is this the expected behavior? Shouldn't the job and its children be completely removed before rescheduling the job?
>
> As a workaround I guess I could redirected the output from my program within the script to avoid the mixing.
>
> By the way is there an option to not append the output of a job to a previously existed file, ie each time a job is submitted with the same output filename, it should replace it and not append.
>
> Thanks
>
> Luis
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=212301
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=212302

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list