[GE users] sge jobs when a node crashes

Reuti reuti at staff.uni-marburg.de
Mon May 1 21:22:12 BST 2006


Hi,

Am 01.05.2006 um 21:49 schrieb Jinal Jhaveri:

> Hi All,
>
> Recently I am seeing a situation where, when a node on which a job  
> is running crashes but  the job is still shown  in "r" state in qstat.
>
> Surprising thing is that when I issue a "qhost" command, SGE  
> correctly thinks that the node is not available.
>
>
> Here is an e.g
>
>
> job-ID  prior   name       user         state submit/start at      
> queue                          slots ja-task-ID
> ---------------------------------------------------------------------- 
> -------------------------------------------
> 2101069 0.55500 meso_BNC1_ xzhao        r     05/01/2006 12:15:38  
> all.q at node64t-01.jgi-psf.org       1       2048906 0.55500  
> submit_sge htu          r    04/25/2006 06:36:36 assem-euk- 
> high at node64t-69.jgi-     1

can you try using the "reschedule unknown" option and submit the jobs  
with "-r y"? Please have a look into "man sge_conf".

The qstat output will not change, because it maybe a temporary  
problem only (e.g. network).

-- Reuti

>
> output of qhost:
>
> ..........
> ..........
>
>
>
> node64t-68              lx24-amd64      2  0.02    3.9G  117.8M    
> 21.3G    7.8M
> node64t-69              lx24-amd64      2     -    3.9G       -    
> 21.3G       -
> node64t-70              lx24-amd64      2  0.02    3.9G   99.8M     
> 5.3G     0.0
>
>
> Also I have manually checked that the node is indeed down:
>
> [jjhaveri at node64t-00 ~]$ ssh node64t-69
> ssh: connect to host node64t-69 port 22: No route to host
>
> Has anybody seen this? I am using version 6u6. Any suggestions on  
> how to avoid this ? We have users who rely on qstat output to check  
> whether the job has finished and in such cases they have to wait  
> for long long time before they can know that there is somethign wrong.
>
>
> I have been seeing this a lot lately and any help on this would be  
> really appreciated.
>
>
>
> thank you
> --Jinal
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list