[GE users] sge_execd says it starts but it doesn't start

futurity neil at futurity.co.uk
Tue Apr 27 16:56:31 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi again,

Looking through the strace in more details it does eventually find the shared libraries locally in the "/lib" directory.

i.e. /lib/libdl.so.2

It continues to try loading the libraries from the $SGE_ROOT tree and then finding them in "/lib" until it gets to:

open("/rmt/sge62/locale/en/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)

Which it can't find and then doesn't try to find it anywhere else.

and also with:

stat64("/rmt/sge62/default/common/sgeCA/rand.seed", 0xbfcfffd0) = -1 ENOENT (No such file or directory)

which again it doesn't try to find anywhere else.

Then I get a long list of Bad file descriptors for example:
close(916)                              = -1 EBADF (Bad file descriptor)
close(917)                              = -1 EBADF (Bad file descriptor)
close(918)                              = -1 EBADF (Bad file descriptor)

Then the list of missing locale files which I put in my second email and then it exits.

Is this strace useful, or a red herring?

Kind Regards

Neil

On 27 April 2010 16:42, Neil Baker <neil at futurity.co.uk<mailto:neil at futurity.co.uk>> wrote:
Hi,

Further to my last posting, I've run a strace and have seen the following output:

access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/tls/i686/sse2/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/tls/i686/sse2", 0xbfd002e8) = -1 ENOENT (No such file or directory)
open("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/tls/i686/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/tls/i686", 0xbfd002e8) = -1 ENOENT (No such file or directory)
open("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/tls/sse2/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/tls/sse2", 0xbfd002e8) = -1 ENOENT (No such file or directory)
open("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/tls/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/tls", 0xbfd002e8) = -1 ENOENT (No such file or directory)
open("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/i686/sse2/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/i686/sse2", 0xbfd002e8) = -1 ENOENT (No such file or directory)
open("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/i686/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/i686", 0xbfd002e8) = -1 ENOENT (No such file or directory)
open("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/sse2/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/sse2", 0xbfd002e8) = -1 ENOENT (No such file or directory)
open("/rmt/sge62/bin/lx24-x86/../../lib/lx24-x86/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)

It appears that the tls, i686, sse2 directories are missing from the $SGE_ROOT/lib/lx24-x86 as is the libdl.so.2 shared library.  Have I missed something from the install or are these optional?

On the execution host, "uname -a" gives:

Linux stg-zoom1 2.6.22.19-0.4-bigsmp #1 SMP 2009-08-14 02:09:16 +0200 i686 i686 i386 GNU/Linux

The strace also reports problems opening locale files.  sge_execd exits after these messages are reported:

open("/usr/share/locale-langpack/en_GB.UTF-8/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/rmt/sge62/locale/en_GB.UTF-8/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-bundle/en_GB.UTF-8/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en_GB.utf8/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/rmt/sge62/locale/en_GB.utf8/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-bundle/en_GB.utf8/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en_GB/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/rmt/sge62/locale/en_GB/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-bundle/en_GB/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en.UTF-8/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/rmt/sge62/locale/en.UTF-8/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-bundle/en.UTF-8/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en.utf8/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/rmt/sge62/locale/en.utf8/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-bundle/en.utf8/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/rmt/sge62/locale/en/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-bundle/en/LC_MESSAGES/lx24-x86/gridengine.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
futex(0x81e3768, FUTEX_WAKE, 2147483647) = 0
close(3)                                = 0
exit_group(0)

Kind Regards

Neil


On 27 April 2010 16:29, Neil Baker <neil at futurity.co.uk<mailto:neil at futurity.co.uk>> wrote:
Hi,

I'm in the process of installing a new grid with the aim of migrating machines from our 61 grid to 62u5.

Unfortunately the sge_execd process doesn't seem to start on our execution host machines.

The qmaster installed without any problems (on openSuse 10.3 32bit) and when started using "/etc/init.d/sgemaster.p6444 start" the process works fine.  qstat, qhost etc all work fine.

The sge_execd installed without any problems (again on openSuse 10.3 32bit) and when started using "/etc/init.d/sgeexecd.p6444 start" it says it started, but the process just isn't running.  qhost lists the new execution host, but with dashes against the new host (not the details as expected).

I've even tried running "/rmt/sge62/bin/lx24-x86/sge_execd" as user sgeadmin62 (with the correct environment) and no errors are reported, but again the process isn't running.

The only non default value used during the sge_execd install was the spool directory for which I entered "/local".  I had previously made a directory "/local" on the local disk and chmod'ed it to 777 (still owned by root).  Again it said this was fine, but sge_execd didn't actually make any sub directories or log any messages to files within it (during the install stage or while being run).

Any idea what could be going on?  Is there a way to turn on any debug for sge_execd so I can see what's going on?

Kind Regards

Neil





More information about the gridengine-users mailing list