[GE users] Users defined exit values for job requeue (issue #1053)

Bogdan Costescu bogdan.costescu at iwr.uni-heidelberg.de
Mon May 17 17:21:14 BST 2004


On Mon, 17 May 2004, Ron Chen wrote:

> And Kirk would like the cluster admin. to be able to define a list
> of exit values, and when your jobs exit with one of those, SGE would
> requeue them as well.

While I don't oppose the idea, I don't find it very good either as it
is error-prone. Most of the programs for which I've seen in source
code (or I wrote :-)) have very variable or no policy for exit codes.  
In most cases, exit code 0 means success, but anything else is just
arbitrary. Furthermore, there is often no distinction in exit code
between a permanent error and a transient error.

So, in most cases, it's safer to do something like:

#!/bin/csh
run_my_program
if ($? == 45) then
	exit 99
endif

If and only if I know that the specific exit code 45 from this program 
means a transient error, I translate it into SGE exit code 99 which 
means rescheduling.

> Is defining a list in the queue config like OK with
> you:
>   reschedule_exit_status 1,2,99

This would mean that the exit code of _any_ job run through that queue 
that is equal to those specified will trigger the rescheduling. If the 
admin has control over what jobs are executed and knows what exit 
codes should expect, this works fine. If not, bad things can happen:

#!/bin/csh
grep bogdan /etc/passwd		# hoping that 'bogdan' doesn't exist 
				# on your computer and grep returns 1

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list