[GE users] Headnode's execd timeout when submitting jobs to it

rozenblit rozenblit at gmail.com
Wed Aug 12 13:59:46 BST 2009


I've installed a small 6-node cluster using Rocks 5.2 (with
servicepack 5.2.1), and ran ./install_execd on the headnode to allow
it to run jobs. I selected all the defaults, and I can see both
sge_qmaster and sge_execd running:

 # ps aux | grep sge
root     17423  0.0  0.0  61172   668 pts/2    S+   09:41   0:00 grep sge
sge      19673  0.0  0.0 222696  7244 ?        Sl   Aug11   0:23
sge      22592  0.0  0.0 131152  2064 ?        Sl   Aug11   0:00

As soon as I restart execd, it shows in qstat -f:

# qstat -f
queuename                      qtype resv/used/tot. load_avg arch
all.q at compute-0-0.local        BIP   0/8/8          8.05     lx26-amd64
all.q at compute-0-1.local        BIP   0/8/8          8.03     lx26-amd64
all.q at compute-0-2.local        BIP   0/8/8          8.05     lx26-amd64
all.q at compute-0-3.local        BIP   0/8/8          8.05     lx26-amd64
all.q at compute-0-4.local        BIP   0/8/8          8.02     lx26-amd64
all.q at jambo.local              BIP   0/0/8          0.01     lx26-amd64

but then, as soon as a job is submitted to "jambo" (headnode),
qmaster/messages shows this:

 08/12/2009 09:45:24|worker|jambo|E|got max. unheard timeout for
target "execd" on host "jambo.local", can't deliver job "844"
08/12/2009 09:45:24|worker|jambo|W|rescheduling job 844.1
08/12/2009 09:45:24|worker|jambo|E|failed delivering job 844.1
08/12/2009 09:45:24|worker|jambo|W|Skipping remaining 7 orders
08/12/2009 09:45:24|schedu|jambo|E|failed delivering job 844.1

and qstat -explain a -f shows:

all.q at jambo.local              BIP   0/0/8          -NA-     lx26-amd64    au
        error: no value for "np_load_avg" because execd is in unknown state

the other nodes are working flawlessly, only headnode has problems
with execd. I'm running out of ideas of what this could be, as there's
almost no error messages. Only rarely some commlib errors shows up,
for example:

08/11/2009 21:34:09|  main|jambo|E|commlib error: endpoint is not
unique error (endpoint "jambo.xxx.xxx.yy/execd/1" is already

08/11/2009 21:34:14|listen|jambo|E|commlib error: endpoint is not
unique error (endpoint "jambo.xxx.xxx.yy/execd/1" is already

where "jambo.xxx.xxx.yy" is the external hostname.

Sorry about the long mail, but I felt it was necessary to properly
explain this situation.

I'd really appreciate any help, as I don't know much about sge inner
workings. My hypothesis is that somehow ./install_execd is getting the
external hostname when installing, but I don't know if that's correct.



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list