[GE users] Re: unavailable nodes and loadleveling

Reuti reuti at staff.uni-marburg.de
Mon May 2 15:50:49 BST 2005


Hi,

Jiann-Ming Su wrote:
> On 5/2/05, Jiann-Ming Su <sujiannming at gmail.com> wrote:
> 
>>Are there other ways to get a node in the E state?
>>
> 
> 
> Here's what one of my nodes is doing to get in to E(rror):
> 
>   05/02/2005 09:56:09|qmaster|sge_admin_host|W|job 413.1 failed on
> host node10.mydomain.bogus general assumedly before job because: can't
> create directory active_jobs/413.1: No such file or directory
>   05/02/2005 09:56:09|qmaster|sge_admin_host|W|rescheduling job 413.1
>   05/02/2005 09:56:09|qmaster|sge_admin_host|E|queue all.q marked
> QERROR as result of job 413's failure at host node10.mydomain.bogus
>   05/02/2005 09:56:10|qmaster|sge_admin_host|W|scheduler tries to
> change tickets of a non running job 413 task 1
> 
> I'm guessing right now that may be an NFS issue which is probably why it
> "can't create directory..."

this may be. Easy solution would be to use a local spool directory on 
all nodes. Otherwise the job will be transmitted by SGE to the nodes, 
and the nodes will write the job dir again on the file server (if it's 
the same as the qmaster machine). I put it always in /var/spool/sge (and 
/var/spool/sge/qmaster) with these entries in the SGE configuration. - Reuti


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list