[GE users] error: commlib error: access denied

henk h.a.slim at durham.ac.uk
Mon Jan 11 16:08:29 GMT 2010


Hi Reuti

The problem seems to be sporadic and in particular with OpenMPI 1.2.3 on
InfiniBand like


Fri Jan  8 17:03:10 GMT 2010
error: commlib error: access denied (server host resolves destination
host "node259 " as "(HOST_NOT_RESOLVAB
LE)")
error: executing task of job 345487 failed: failed sending task to
execd at node259: can't find connection
[node244:25302] ERROR: A daemon on node node259 failed to start as
expected.
[node244:25302] ERROR: There may be more information available from
[node244:25302] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node244:25302] ERROR: If the problem persists, please restart the
[node244:25302] ERROR: Grid Engine PE job
[node244:25302] ERROR: The daemon exited unexpectedly with status 1.

etc. 

for each of 16 nodes @4 slots. In the cases that appear to work there is
a trailing messages:

[node247:12738] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[node247:12738] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_gridengine_module.c at line 826
------------------------------------------------------------------------
--
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.

------------------------------------------------------------------------
--

Is there any other information to look at?

Thanks

Henk



 
> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: 11 January 2010 15:04
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] error: commlib error: access denied
> 
> Hi,
> 
> Am 09.01.2010 um 20:29 schrieb henk:
> 
> > Sometimes the following error occurs (sge 6.1u2) on our system:
> >
> >
> > "error: commlib error: access denied (server host resolves
> destination
> > host "node259 " as "(HOST_NOT_RESOLVABLE
> > )")
> > error: executing task of job 345487 failed: failed sending task to
> > execd at node259: can't find connection"
> >
> > I found this related issue #1400 for version 6.0u2:
> >
> > http://gridengine.sunsource.net/issues/show_bug.cgi?
> > id=1400&historysort=
> > new
> >
> > Is this issue still a known problem or is the error caused by
> > something
> > else?
> >
> under what circumstances does this message appear in your case?
> 
> -- Reuti
> 
> > Any advise is greatly appreciated
> >
> > Thanks
> >
> > Henk
> 
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
> eId=238111
> 
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=238126

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list