[GE users] node failures

Nicholas Senedzuk nicholas.senedzuk at gmail.com
Wed Mar 21 00:30:53 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I have been trying to look up some info on node failures in the grid. When I
say node failures I mean what happens if the network cable gets pulled from
a system or the system looses power.

I figure if the system looses its network that the job will continue to run
but will not be able to let the qmaster know that the job is done if the
network does not come back before the processing finishes. What happens to
this job? Does it just get put into an error state on the queue? Does the
qmaster resubmit the job to another node? Does the error file get written
to? How do I tell this job never completed?

Now if the system looses power I know that job will never complete. What
happens to the job? Does it just get put into an error state on the queue?
Does the qmaster resubmit the job to another node? Does the error file get
written to? How do I tell this job never completed?


Any help that you can provide would be great. Even if you tell me to RTFM
just at least tell me which one to read because I am reading the wrong one.


Nick



More information about the gridengine-users mailing list