[GE users] sge jobs when a node crashes

Nicolas Joly njoly at pasteur.fr
Mon May 1 21:09:40 BST 2006


On Mon, May 01, 2006 at 12:49:24PM -0700, Jinal Jhaveri wrote:
> Hi All,
> 
> Recently I am seeing a situation where, when a node on which a job is 
> running crashes but  the job is still shown  in "r" state in qstat.

Same here. We got a problem with a lx26-amd64 crashing about each 3
weeks, each time we got jobs hanging forever ...

> Surprising thing is that when I issue a "qhost" command, SGE correctly 
> thinks that the node is not available.
[...]
> Has anybody seen this? I am using version 6u6. Any suggestions on how to 
> avoid this ? We have users who rely on qstat output to check whether the 
> job has finished and in such cases they have to wait for long long time 
> before they can know that there is somethign wrong.

We are running SGE 6.0u7 on 14 linux 2 CPUs Opteron machines.

> I have been seeing this a lot lately and any help on this would be 
> really appreciated.

I haven't got enough time to track this one yet ... sorry.

-- 
Nicolas Joly

Biological Software and Databanks.
Institut Pasteur, Paris.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list