[GE users] Rescheduled job joins output with originial job

luisico lug2002 at med.cornell.edu
Fri Aug 14 22:21:38 BST 2009


Thanks for the quick answer. We don't expect the execds to die often, I was trying to understand how sge will react to different failures and prepare for that. The watchdog is probably the easiest solution.

Thanks again

Luis

> That is indeed expected behavior.  If the execd is down, the master has 
> no way to do anything on that node.  This is way the rerunnable 
> attribute is something that the job or queue has to request.  It may not 
> be appropriate for all jobs.
> 
> Do you expect that your execds will be dying often?  That's not exactly 
> a common occurrence, at least not without losing the entire execution 
> node.  You could always set up a watchdog process to make sure that if 
> the execd goes down, it comes right back up again.  Once the execd comes 
> back up, the master will resolve the issue with duplicate jobs.
> 
> The over-engineered solution is to use a custom starter method or prolog 
> to change the job's name before it runs.  That way, if it's rescheduled, 
> it will start writing to a different file.
> 
> Daniel

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=212304

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list