[GE users] More startup oddness

Chris Dagdigian dag at sonsorol.org
Mon Sep 1 16:11:06 BST 2008


James,

First off remove any hostnames you have associated with the loopback  
127.0.0.1 address in /etc/hosts -- Grid Engine absolutely hates those  
entries. All hostnames have to be associated with non-loopback IP  
addresses. That may be your main issue.

The second issue is to not start sge_qmaster directly.

Find the "sgemaster" script in your $SGE_ROOT/$SGE_CELL/common/  
directory and try running that manually as root. Before you do this do  
a "ps ax | grep sge" to make sure you don't have any zombie processes  
lying around.

-Chris






On Sep 1, 2008, at 11:02 AM, James Gibbon wrote:

>
> Hi,
>
> I'm getting an unusual error on starting the grid engine service
> on the qmaster:
>
> root at linux6:~# /etc/init.d/sgemaster start
>
> sge_qmaster didn't start!
> This is not a qmaster host!
> Please, check your act_qmaster file!
>
>
> .. so to get a bit more information,
>
> root at linux6:/home/pipelines/SunGE#  export SGE_ND=1
> root at linux6:/home/pipelines/SunGE# ./bin/lx24-amd64/sge_qmaster
> Reading in complex attributes.
> Reading in execution hosts.
> Reading in administrative hosts.
> Reading in submit hosts.
> Reading in host group entries:
>         Host group entries for group "@allhosts".
> Reading in usersets:
>         Userset "defaultdepartment".
>         Userset "deadlineusers".
> Reading in queues:
>         Queue "all.q".
> error: cannot recreate queue all.q from disk because of unknown host  
> linux6
> read job database with 0 entries in 0 seconds
> Reading in users:
>         User "dan".
> qmaster hard descriptor limit is set to 8192
> qmaster soft descriptor limit is set to 8192
> qmaster will use max. 8172 file descriptors for communication
> qmaster will accept max. 99 dynamic event clients
> starting up 6.0u8
> error: commlib error: local host name error (remote destination host  
> name "linux6" is not equal to local resolved host name "localhost")
> error: can't create job sequence number file "jobseqnum": Permission  
> denied - delaying until next job
>
> .. it hangs at this point. The host is known as 'linux6', and this is
> what's returned by the 'hostname' command. The act_qmaster file  
> contains
> simply 'linux6'.
>
> Why is the startup resolving the hostname as 'localhost'?
> First line of /etc/hosts is:
>
> 127.0.0.1 localhost linux6
>
> .. any suggestions?
>
> Thanks,
> James
>
>
> -- 
> System Administrator
> ------------------------
> Brain and Body Centre
> University of Nottingham
> Nottingham NG7 2RD
> +44 115 846 8255
>
> This message has been checked for viruses but the contents of an  
> attachment
> may still contain software viruses, which could damage your computer  
> system:
> you are advised to perform your own checks. Email communications  
> with the
> University of Nottingham may be monitored as permitted by UK  
> legislation.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list