[GE users] sge jobs when a node crashes

Jinal Jhaveri JAJhaveri at lbl.gov
Mon May 1 20:49:24 BST 2006

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi All,

Recently I am seeing a situation where, when a node on which a job is 
running crashes but  the job is still shown  in "r" state in qstat.

Surprising thing is that when I issue a "qhost" command, SGE correctly 
thinks that the node is not available.

Here is an e.g

job-ID  prior   name       user         state submit/start at     
queue                          slots ja-task-ID
2101069 0.55500 meso_BNC1_ xzhao        r     05/01/2006 12:15:38 
all.q at node64t-01.jgi-psf.org       1       
2048906 0.55500 submit_sge htu          r    04/25/2006 06:36:36 
assem-euk-high at node64t-69.jgi-     1       

output of qhost:


node64t-68              lx24-amd64      2  0.02    3.9G  117.8M   
21.3G    7.8M
node64t-69              lx24-amd64      2     -    3.9G       -   
21.3G       -
node64t-70              lx24-amd64      2  0.02    3.9G   99.8M    
5.3G     0.0

Also I have manually checked that the node is indeed down:

[jjhaveri at node64t-00 ~]$ ssh node64t-69
ssh: connect to host node64t-69 port 22: No route to host

Has anybody seen this? I am using version 6u6. Any suggestions on how to 
avoid this ? We have users who rely on qstat output to check whether the 
job has finished and in such cases they have to wait for long long time 
before they can know that there is somethign wrong.

I have been seeing this a lot lately and any help on this would be 
really appreciated.

thank you

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list