[GE users] job exit codes and tasks pending via hold_jid arguments

reuti reuti at staff.uni-marburg.de
Tue Dec 22 20:03:40 GMT 2009


Am 22.12.2009 um 19:33 schrieb craffi:

> Given this page:
> http://wikis.sun.com/display/GridEngine/Troubleshooting+and+Error 
> +Messages
> ... it seems to show tables that say that any exit code other than 0,
> 100 or 99 from a job will indicate a successful job execution to SGE.
> Exit 100 seems nice but from memory I recall that it will put the job
> into Eqw state or something similar that requires human interaction to
> manually remove.
> The reason this came up is due to a simple workflow that uses job
> dependencies, there are times where the first job encounters a  
> specific
> error case and exits with a special workflow-meaningful code of 255 --
> it looks like SGE does not see this as an actual failure and thus  
> allows
> the dependent jobs to go on for dispatch and execution.

correct. The job dependancy will only look whether the job left the  
system. If it's 100 or 99 it whould wait of course. But with any  
other error it will start the follow up jobs.

> Looking for the proper way to exit on error in a way does not make the
> job linger and also does not allow any jobs with -hold_jid set to
> execute when the upstream task leaves the system.
> Epilog script? Qalter or qdel from within the first job? Something  
> else?

Well, looks like an usage for an idea I put on the list a couple of  
days ago:


You could use just an intermediate job. I mean, the follow up job of  
the real job is just a fake one to make decisions only (it could even  
run in a special queue which is always available), in my pseudo-code  
the qdecide, and this one can use qdel maybe by name or so (and its  
own name as a prefix). As this fake job won't use resources, it can  
also run on the headnode and avoid the necessity to way to make all  
exechosts also submit hosts (or to use rsh/ssh to the headnode).

I don't know, when I will have time to implement this in a complete  
way. For now I'm not sure about the syntax for such a workflow-in-job- 

You can also check:


but AFAIR it will submit the followup job only (or not), when the  
first one finished. And the controller has to run all the time.  
Unfortunately the project seems to be dead, as I didn't get any reply.

-- Reuti

> Regards,
> Chris
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=234638
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list