[GE users] shadow master problems

marble martin.siegert at mdc-berlin.de
Fri Jul 23 13:58:50 BST 2010


Dear Rayson,

(I am the co-admin of Andreas, who is not in the office for a few days...)

> As a test, can you restart the execd on the login2 and
> see if it can pick up the new qmaster??
Sorry, but sge_execd is only running on the compute-nodes (e.g. node001) but not running on the master-node(s) "Login2" (+Login1) itself, so I cannot REstart it. And as it it not configured, I doubt to launch it on the master node would be helpfull.

> And does qstat work at all?
After a fail-over, "qstat" talks to the shadow-master (Login1) and knows about running and pending jobs, comming from the previous qmaster (Login2). But as qmaster on Login1 doesn't hear from the compute nodes about their load, no pending jobs are started. (Even though new submitted jobs are accepted and launched by Login2 when we hand-back the qmaster to this machine.

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
standard at node001               BP    0/5/8          -NA-     lx24-amd64    au


> Just wondering if it is a name resolution problem...
As quoted below: The compute nodes keep trying to report to the failed qmaster (Login2) even though "Login1" can be read in the act_qmaster file and both LoginX nodes are listed in /etc/hosts and NIS, and are ping'able of course.

In /opt/sge/default/spool/node001/messages
>>>> main|node001|W|can't register at "qmaster": unable to contact qmaster using port 6444 on host "login2"

Thank You for further Ideas.
Martin

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=269931

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list