[GE users] 6.2 sgeexecd fails to keep running: can't connect to service

Harry Mangalam harry.mangalam at uci.edu
Thu Nov 20 18:22:30 GMT 2008


thanks Reuti,

There was a DNS problem (actually several) which we have now resolved, 
I think.  take away messages:

1 - make sure your DNS system is working and all the nodes can forward 
and reverse lookup each other and your Q master. 
2 - make sure all the exec nodes' /etc/hosts file is configured 
identically.
3 -The problems expand if the person administering the DNS system is 
not the same person administering the SGE system :(


Thanks again.

Harry



On Tuesday 18 November 2008, reuti wrote:
> Hi,
>
> Am 18.11.2008 um 21:06 schrieb Harry Mangalam:
> > I have 2 subclusters (different archs) running under 6.2.  When I
> > try to start sgeexecd on subcluster bduc-i32, sgeexecd starts and
> > then fails after a minute or so.  The only message I can see is
> > in /tmp/execd_messages.nnnnn:
> >
> > 11/18/2008 11:51:32|  main|bduc-i32-16|E|can't connect to service
> > 11/18/2008 11:51:32|  main|bduc-i32-16|E|can't get configuration
> > from qmaster -- backgrounding
> >
> > the bduc-amd64 subcluster (oddly, the one 'further away' on a
> > public IP net) works fine and the output of qhost shows:
> >
> > HOSTNAME      ARCH       NCPU  LOAD MEMTOT  MEMUSE SWAPTO SWAPUS
> > ----------------------------------------------------------------
> > global        -             -     -      -       -      -      -
> > bduc-amd64-1  lx24-amd64    2  0.00   3.9G  152.0M   1.0G    0.0
> > bduc-amd64-10 lx24-amd64    2  0.00   2.0G  148.1M   1.0G    0.0
> > bduc-amd64-11 lx24-amd64    2  0.00   3.9G  149.3M   1.0G    0.0
> > bduc-amd64-12 lx24-amd64    2  0.00   3.9G  148.6M   1.0G    0.0
> > bduc-amd64-13 lx24-amd64    2  0.00   3.9G  148.4M   1.0G    0.0
> >  ...
> > bduc-i32-10   lx24-x86      2     -   2.0G       -   3.9G      -
> > bduc-i32-11   lx24-x86      2     -   4.0G       -   3.9G      -
> > bduc-i32-12   lx24-x86      2     -   4.0G       -   3.9G      -
> > bduc-i32-13   lx24-x86      2     -   4.0G       -   3.9G      -
> > bduc-i32-14   lx24-x86      2     -   4.0G       -   3.9G      -
> >
> > indicating the failure of sgeexecd to run on the i32 nodes.
>
> also the internal nodes will have to contact the qmaster under his
> external name. Maybe for now they can't find the qmaster - you will
> have to setup a route from the internal nodes to the qmaster.
>
> I.e. a "ping <external_name_of _the qmaster>" should work on the
> internal nodes.
>
> -- Reuti
>
> > Is this sound like a name resolution problem?  Or something else?
> >  No firewall are involved AFAIK.
> >

-- 
Harry Mangalam - Research Computing, NACS, E2148, Engineering Gateway, 
UC Irvine 92697  949 824-0084(o), 949 285-4487(c)
---
Good judgment comes from experience; 
Experience comes from bad judgment. [F. Brooks.]

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89254

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list