[GE users] Random job failures

bdo brendon.oliver at gmail.com
Mon Dec 7 21:42:11 GMT 2009

On Monday 07 December 2009 22:56:52 craffi wrote:
> Before the suggestions below I'd offer that the usual reason behind
> "jobs work on head node but fail oddly when run under SGE" can sometimes
> be traced to shell, path or environment variables that differ between
> how you run interactively versus under SGE. It's not enough just to run
> and rerun the application - you need to pretty carefully examine the
> total job environment to see "what is different"

Yes, I realise that.  When I said I haven't been able to induce the job to 
fail on-demand, I was meaning when run via SGE.

> With that said ...
> Two suggestions:
> (1) Look into the keep_active=true parameter, more info in the sge_conf
> manpage:

Great, I think this will help me immensely to track down the problem.
> (2) The second suggestion is that you could consider writing an epilog
> script that would run globally and automatically detect these errors as
> they happen. 

hmm, we already do something like this (ie. harvesting the job's stderr log), 
although not via an epilog script, but thanks for the suggestion, it's 
something to consider.


- Brendon


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list