[GE users] sge jobs when a node crashes

Iwona Sakrejda isakrejda at lbl.gov
Mon May 1 21:07:42 BST 2006



Jinal Jhaveri wrote:
> Hi All,
> 
> Recently I am seeing a situation where, when a node on which a job is 
> running crashes but  the job is still shown  in "r" state in qstat.
> 

I think this is even documented somewhere and it's a feature.
Until the qmqster can contact the node and confirm the status
for all it knows the job is 'running'.
What's more if you try to do qdel - the job will be stuck in a "dr"
state.

When you reboot, the problem job goes away, but sometimes
the node developes hardware problems and you cannot bring it back
right away. I do qdel -f for the jobs associated with it then,
so that users know when the job is gone.

Iwona

> Surprising thing is that when I issue a "qhost" command, SGE correctly 
> thinks that the node is not available.
> 
> 
> Here is an e.g
> 
> 
> job-ID  prior   name       user         state submit/start at     
> queue                          slots ja-task-ID
> ----------------------------------------------------------------------------------------------------------------- 
> 
> 2101069 0.55500 meso_BNC1_ xzhao        r     05/01/2006 12:15:38 
> all.q at node64t-01.jgi-psf.org       1       2048906 0.55500 submit_sge 
> htu          r    04/25/2006 06:36:36 assem-euk-high at node64t-69.jgi-     
> 1      
> 
> 
> output of qhost:
> 
> ..........
> ..........
> 
> 
> 
> node64t-68              lx24-amd64      2  0.02    3.9G  117.8M   
> 21.3G    7.8M
> node64t-69              lx24-amd64      2     -    3.9G       -   
> 21.3G       -
> node64t-70              lx24-amd64      2  0.02    3.9G   99.8M    
> 5.3G     0.0
> 
> 
> Also I have manually checked that the node is indeed down:
> 
> [jjhaveri at node64t-00 ~]$ ssh node64t-69
> ssh: connect to host node64t-69 port 22: No route to host
> 
> Has anybody seen this? I am using version 6u6. Any suggestions on how to 
> avoid this ? We have users who rely on qstat output to check whether the 
> job has finished and in such cases they have to wait for long long time 
> before they can know that there is somethign wrong.
> 
> 
> I have been seeing this a lot lately and any help on this would be 
> really appreciated.
> 
> 
> 
> thank you
> --Jinal
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list