[GE users] Disappearing hosts/queues with PE's

Andreas Haas Andreas.Haas at Sun.COM
Fri Feb 25 16:29:51 GMT 2005


On Fri, 25 Feb 2005 jeroen.m.kleijer at philips.com wrote:

> Hi Stephan,
>
> It took me a while to compile SGE cleanly (especially the bdb stuff which
> wouldn't work because it's run from an NFS directory.... stupid me,
> should've read the docs) but, the maintrunk has this problem fixed as far
> as I can tell. Hosts no longer disappear.
>
> I've used the same setup with the sge-lam perl script (without the open
> filedescriptor in qrsh_local) and even though I still get a race condition
> in qrsh, it no longer makes my execution host dissapear. The qrsh process
> spawns lamboot (which fails) and after three minutes dies.
>
> The job is then returned to a pending state, waiting to be rescheduled and
> leaves the queue in which it was previously started in an error state.
>
> This leaves me with another question:
>
> If the starting of a parallel environment manages to leave a queue in an
> error state the offending job will get placed back in a pending state and
> when resources are available it will go to another queue/host but most
> likely this will fail as well causing a loop which manages to leave every
> queue / hosts in an error state.
>
> Is there anyway to prevent this?

Yes.

If your parallel environment startup procedure can detect the prolems
arose due to problems with the job it can trigger job error state by
exiting with 100. In a similar fashion the job script can trigger job
error by returning 100. Please see under 'FORBID_APPERROR' in sge_conf(5).

Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list