[GE users] Random job failures

bdo brendon.oliver at gmail.com
Mon Dec 7 21:42:11 GMT 2009


On Monday 07 December 2009 22:56:52 craffi wrote:
> Before the suggestions below I'd offer that the usual reason behind
> "jobs work on head node but fail oddly when run under SGE" can sometimes
> be traced to shell, path or environment variables that differ between
> how you run interactively versus under SGE. It's not enough just to run
> and rerun the application - you need to pretty carefully examine the
> total job environment to see "what is different"

Yes, I realise that.  When I said I haven't been able to induce the job to 
fail on-demand, I was meaning when run via SGE.

> With that said ...
> 
> Two suggestions:
> 
> (1) Look into the keep_active=true parameter, more info in the sge_conf
> manpage:

Great, I think this will help me immensely to track down the problem.
 
> (2) The second suggestion is that you could consider writing an epilog
> script that would run globally and automatically detect these errors as
> they happen. 

hmm, we already do something like this (ie. harvesting the job's stderr log), 
although not via an epilog script, but thanks for the suggestion, it's 
something to consider.


Regards,

- Brendon

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=232100

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list