[GE users] Execd not starting up

Ron Chen ron_chen_123 at yahoo.com
Thu Oct 16 07:32:49 BST 2008


Looks like network setup related. How is the setup of this host different from that of other working hosts?

 -Ron

--- On Thu, 10/16/08, Sean Davis <sdavis2 at mail.nih.gov> wrote:
> > Am 15.10.2008 um 20:03 schrieb Sean Davis:
> >
> >> I have a node that was rebooted and now does not
> start execd.  The
> >> cluster is running SGE6.2 and is
> openSUSE-linux-based.  With some
> >> debugging output turned on, I get:
> >>
> >> /import/cluster/sge/bin/lx24-amd64/sge_execd
> >> cl_com_read_alias_file() [cl_communication.c/2510]
> main
> >> => host alias file is not existing
> >
> > are you using a host_aliases file in
> $SGE_ROOT/default/common? Is this
> > location shared between the nodes and the master?
> 
> Thanks, Reuti.  The answers are "no" and
> "yes" respectively.
> 
> Sean
> 
> >>
> >> cl_commlib_get_endpoint_status()
> [cl_commlib.c/6132] main
> >>  => waiting for SIRM with id 1
> >>
> >> cl_commlib_get_endpoint_status()
> [cl_commlib.c/6201] main
> >>  => no SRIM for SIM with id 1
> >>
> >> cl_commlib_get_endpoint_status()
> [cl_commlib.c/6201] main
> >>  => no SRIM for SIM with id 1
> >>
> >> cl_commlib_get_endpoint_status()
> [cl_commlib.c/6201] main
> >>  => no SRIM for SIM with id 1
> >>
> >> cl_commlib_get_endpoint_status()
> [cl_commlib.c/6183] main
> >>  => got SIRM for SIM with id: 1
> >>
> >> cl_com_tcp_open_connection_request_handler()
> [cl_tcp_framework.c/1771]
> >> execd_read          => select interrupted
> (errno=EINTR)
> >>
> >> cl_com_handle_read_thread() [cl_commlib.c/6935]
> execd_read          =>
> >> got select interrupt
> >>
> >> cl_com_tcp_open_connection_request_handler()
> [cl_tcp_framework.c/1771]
> >> execd_read          => select interrupted
> (errno=EINTR)
> >>
> >> cl_com_handle_read_thread() [cl_commlib.c/6935]
> execd_read          =>
> >> got select interrupt
> >>
> >> cl_com_tcp_open_connection_request_handler()
> [cl_tcp_framework.c/1771]
> >> execd_read          => select interrupted
> (errno=EINTR)
> >>
> >> Ad infinitum, pretty much.
> >>
> >> Any suggestions on where to look next?  There is
> no firewall on the
> >> machine and the hostname looks up correctly.  The
> qmaster is running
> >> and responding for all the other machines in the
> cluster.
> >>
> >> Thanks,
> >> Sean
> >>
> >>
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> >>
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail:
> users-help at gridengine.sunsource.net


      

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list