[GE users] Don't want bad jobs to restart

templedf dan.templeton at sun.com
Thu May 28 19:21:34 BST 2009


Hmm... Interesting issue. The queue is being set into error state 
because the job can't complete because the disk is full. For that 
reason, you can't do anything with an epilog because it too would fail 
for the same reason. How about doing this:

1) Set up a prolog that:
a) Checks if the job context contains a flag, like tried=1. If so, the 
job failed in the previous run so exit with 100.
b) Adds a tried=1 to the job context.
2) Set up an epilog that removes tried=1 from the job context.

That way, any job that fails in a catastrophic way, preventing the 
epilog from running, will be flagged for failure next time it runs.

Daniel

mad wrote:
> If a user job fails because it fills up a disk, the compute node is  
> taken off the queue list with an error.  That's good, so that I can  
> see the problem.  The bad part is that the "errant" job is not killed,  
> but rather seems to reschedule itself.
>
> How do I get the system to kill these jobs when they use up resources  
> on a compute node?
>
> Grid Engine  6.1u4
>
> Thanks
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=199503
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=199507

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list