[GE users] node failures

Rayson Ho rayrayson at gmail.com
Wed Mar 21 01:58:54 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

It all depends on your setup :)

Refer to sge_conf(5) - max_unheard , reschedule_unknown

online version:
http://gridengine.sunsource.net/nonav/source/browse/~checkout~/gridengine/doc/htmlman/htmlman5/sge_conf.html

Rayson



On 3/20/07, Nicholas Senedzuk <nicholas.senedzuk at gmail.com> wrote:
> I have been trying to look up some info on node failures in the grid. When I
> say node failures I mean what happens if the network cable gets pulled from
> a system or the system looses power.
>
> I figure if the system looses its network that the job will continue to run
> but will not be able to let the qmaster know that the job is done if the
> network does not come back before the processing finishes. What happens to
> this job? Does it just get put into an error state on the queue? Does the
> qmaster resubmit the job to another node? Does the error file get written
> to? How do I tell this job never completed?
>
> Now if the system looses power I know that job will never complete. What
> happens to the job? Does it just get put into an error state on the queue?
> Does the qmaster resubmit the job to another node? Does the error file get
> written to? How do I tell this job never completed?
>
>
> Any help that you can provide would be great. Even if you tell me to RTFM
> just at least tell me which one to read because I am reading the wrong one.
>
>
> Nick
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list