[GE users] problem with install a new SGE execution host

Chunyan Wang wangch at cpsc.ucalgary.ca
Thu Feb 17 00:14:05 GMT 2005


The "execd" was running on sge-a host yesterday after installed, but now 
is not running, the result looks like this:

(wangc)$ ps -eaf |grep sge_execd
   wangc   972   415  0 20:04:57 pts/2    0:00 grep sge_execd

[ sge-a:/opt/n1ge6/utilbin/sol-sparc64 ]
(wangc)$

I use "qping -info sge-a 536 execd 1" to check on the master host, then 
I get the result:

coe01:/export/data/web/moby/cgi-bin 195 % qping -info sge-a 536 execd 1
endpoint sge-a/execd/1 at port 536: can't find connection

I also use "telnet master 536", then I get the result:

[ sge-a:/export/home/wangc/load-sensors ]
(wangc)$ telnet coe01.ucalgary.ca 536
Trying 136.159.169.6...
Connected to coe01.
Escape character is '^]'.

So, port 536 is open.  But I don't know why execd on sge-a cannot 
connect to the master host. Could anyone tell me what do I need to check 
next?

Thanks,

Joyce

Tim Harsch wrote:

>find and kill all sge_execd's on that host, rerun
>$SGE_ROOT/defalut/common/sgeexecd as root, verify it starts via grepping ps.
>
>----- Original Message ----- 
>From: "Chunyan Wang" <wangch at cpsc.ucalgary.ca>
>To: <users at gridengine.sunsource.net>
>Sent: Wednesday, February 16, 2005 12:19 PM
>Subject: [GE users] problem with install a new SGE execution host
>
>
>  
>
>>Hi all,
>>I have sge6.3 running. I want to install another execution host on sge-a
>>host. I run install_execd script on sge-a. We share $SGE_ROOT to all
>>hosts. I created a queue for sge-a, and the queue is in "au" state, this
>>means no report information from sge-a host. I checked execd is not
>>running on sge-a host. I found an error message on sge-a host:
>>[ sge-a:/tmp ]
>>(wangc)$ ls
>>execd_messages.300  execd_messages.571  execd_messages.637
>>execd_messages.699
>>
>>[ sge-a:/tmp ]
>>(wangc)$ cat execd_messages.637
>>02/15/2005 19:52:49|execd|sge-a|C|can't create execd handle for "execd"
>>with id 1, using port 537
>>02/15/2005 19:52:50|execd|sge-a|C|can't create execd handle for "execd"
>>with id 1, using port 537
>>02/15/2005 19:52:51|execd|sge-a|C|can't create execd handle for "execd"
>>with id 1, using port 537
>>02/15/2005 19:52:52|execd|sge-a|C|can't create execd handle for "execd"
>>with id 1, using port 537
>>02/15/2005 19:52:53|execd|sge-a|C|can't create execd handle for "execd"
>>with id 1, using port 537
>>
>>Port 536 and 537 are open. root access on sge-a.
>>I check the discussion list, and found someone suggested use local spool
>>directory for the new exection host.
>>Any suggestions about this problem?
>>
>>Thanks alot!
>>
>>Joyce
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>  
>




More information about the gridengine-users mailing list