[GE users] qmaster and tightly integrated tasks

Reuti reuti at staff.uni-marburg.de
Fri Jan 20 21:37:30 GMT 2006


Hi,

Am 20.01.2006 um 16:13 schrieb Thomas Neumann:

> Hello !
>
> Analysing a problem with one of my jobscripts I came accross the  
> following behaviour of the qmaster when running tightly integrated  
> tasks:
>
> If a tightly integrated task fails, the qmaster tells in the  
> messagefile something like
> 'tightly integrated parallel task 20416.0 task 988.cmp1 fails -  
> killing job'
> Some seconds later the job is killed.
>
> In some cases this behaviour is a problem for my scripts: For  
> example I've got several processes started by the jobscript having  
> a predefined timeout. Running in a timeout all subtasks are killed  
> with SIGKILL or SIGTERM, but the script has a defined behaviour to  
> react on the timeout condition and continuing execution (e.g.  
> cleanup-routines for shared memory or other tasks to execute).  
> Unfortunately the qmaster registers the killed subtasks and  
> terminates the job before the job can complete execution.
>
> Is there a way to prevent the qmaster from killing a job due to  
> tightly integrated subtask failures ?
>

I see three options:

a) use a custom terminate_method for this queue

b) use the -notify option to qsub, which can be trapped in your  
jobscript

c) use a soft-limit for the timeout which can also be trapped in your  
jobscript

-- Reuti


> Thanks,
>    Thomas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list