[GE users] Headnode's execd timeout when submitting jobs to it

rozenblit rozenblit at gmail.com
Wed Aug 12 13:59:46 BST 2009


Hello,

I've installed a small 6-node cluster using Rocks 5.2 (with
servicepack 5.2.1), and ran ./install_execd on the headnode to allow
it to run jobs. I selected all the defaults, and I can see both
sge_qmaster and sge_execd running:


 # ps aux | grep sge
root     17423  0.0  0.0  61172   668 pts/2    S+   09:41   0:00 grep sge
sge      19673  0.0  0.0 222696  7244 ?        Sl   Aug11   0:23
/opt/gridengine/bin/lx26-amd64/sge_qmaster
sge      22592  0.0  0.0 131152  2064 ?        Sl   Aug11   0:00
/opt/gridengine/bin/lx26-amd64/sge_execd
#


As soon as I restart execd, it shows in qstat -f:


# qstat -f
queuename                      qtype resv/used/tot. load_avg arch
    states
---------------------------------------------------------------------------------
all.q at compute-0-0.local        BIP   0/8/8          8.05     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-0-1.local        BIP   0/8/8          8.03     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-0-2.local        BIP   0/8/8          8.05     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-0-3.local        BIP   0/8/8          8.05     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-0-4.local        BIP   0/8/8          8.02     lx26-amd64
---------------------------------------------------------------------------------
all.q at jambo.local              BIP   0/0/8          0.01     lx26-amd64
#


but then, as soon as a job is submitted to "jambo" (headnode),
qmaster/messages shows this:

 08/12/2009 09:45:24|worker|jambo|E|got max. unheard timeout for
target "execd" on host "jambo.local", can't deliver job "844"
08/12/2009 09:45:24|worker|jambo|W|rescheduling job 844.1
08/12/2009 09:45:24|worker|jambo|E|failed delivering job 844.1
08/12/2009 09:45:24|worker|jambo|W|Skipping remaining 7 orders
08/12/2009 09:45:24|schedu|jambo|E|failed delivering job 844.1

and qstat -explain a -f shows:

[...]
all.q at jambo.local              BIP   0/0/8          -NA-     lx26-amd64    au
        error: no value for "np_load_avg" because execd is in unknown state


the other nodes are working flawlessly, only headnode has problems
with execd. I'm running out of ideas of what this could be, as there's
almost no error messages. Only rarely some commlib errors shows up,
for example:

jambo/messages:
08/11/2009 21:34:09|  main|jambo|E|commlib error: endpoint is not
unique error (endpoint "jambo.xxx.xxx.yy/execd/1" is already
connected)

qmaster/messages:
08/11/2009 21:34:14|listen|jambo|E|commlib error: endpoint is not
unique error (endpoint "jambo.xxx.xxx.yy/execd/1" is already
connected)

where "jambo.xxx.xxx.yy" is the external hostname.


Sorry about the long mail, but I felt it was necessary to properly
explain this situation.

I'd really appreciate any help, as I don't know much about sge inner
workings. My hypothesis is that somehow ./install_execd is getting the
external hostname when installing, but I don't know if that's correct.

Thanks,
Fernando

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211990

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list