[GE users] problem with install a new SGE execution host

Chunyan Wang wangch at cpsc.ucalgary.ca
Thu Feb 17 18:54:21 GMT 2005


The new execd is not running as the result (on the execution host):

(wangc)$ ps -f | grep execd
   wangc  1024   415  0 14:43:26 pts/2    0:00 grep execd

I did "prstat" on the host, I don't see execd running too, but I can 
submit a job to other execution hosts. The queue of the host is on "au" 
state. I know what "au" state mean. But I don't know how to fix it. I 
looked on the discussion list. Someone suggested to use local spool 
directory configuration. I wonder anyother ways to fix this problem. 
Could anyone help me with this?
Thanks,
Joyce

Charu Chaubal wrote:

>The numbers used for these two ports can be anything, as long as it
>doesn't conflict with any existing service (both in /etc/services as
>well as possibly other places, like NIS).
>
>Even the "defaults" 535/536/537 are not official, since they are not
>registered.
>
>Regards,
>	Charu
>
>
>Chunyan Wang wrote:
>  
>
>>The SGE 6.0 documentation says to use 536/tcp for sge_qmaster and
>>537/tcp for sge_execd in Chapter - Installing the Grid Engine Software
>>Interactively.
>>
>>Joyce
>>
>>Tim Harsch wrote:
>>
>>    
>>
>>>Is there a reason why you've chosen 536 for your master and 537 for
>>>your execd?  The defaults are 535 and 536, respectively...
>>> 
>>>----- Original Message -----
>>>
>>>    *From:* Chunyan Wang <mailto:wangch at cpsc.ucalgary.ca>
>>>    *To:* users at gridengine.sunsource.net
>>>    <mailto:users at gridengine.sunsource.net>
>>>    *Sent:* Wednesday, February 16, 2005 4:14 PM
>>>    *Subject:* Re: [GE users] problem with install a new SGE execution
>>>    host
>>>
>>>    The "execd" was running on sge-a host yesterday after installed,
>>>    but now is not running, the result looks like this:
>>>
>>>    (wangc)$ ps -eaf |grep sge_execd
>>>       wangc   972   415  0 20:04:57 pts/2    0:00 grep sge_execd
>>>
>>>    [ sge-a:/opt/n1ge6/utilbin/sol-sparc64 ]
>>>    (wangc)$
>>>
>>>    I use "qping -info sge-a 536 execd 1" to check on the master host,
>>>    then I get the result:
>>>
>>>    coe01:/export/data/web/moby/cgi-bin 195 % qping -info sge-a 536
>>>    execd 1
>>>    endpoint sge-a/execd/1 at port 536: can't find connection
>>>
>>>    I also use "telnet master 536", then I get the result:
>>>
>>>    [ sge-a:/export/home/wangc/load-sensors ]
>>>    (wangc)$ telnet coe01.ucalgary.ca 536
>>>    Trying 136.159.169.6...
>>>    Connected to coe01.
>>>    Escape character is '^]'.
>>>
>>>    So, port 536 is open.  But I don't know why execd on sge-a cannot
>>>    connect to the master host. Could anyone tell me what do I need to
>>>    check next?
>>>
>>>    Thanks,
>>>
>>>    Joyce
>>>
>>>    Tim Harsch wrote:
>>>
>>>      
>>>
>>>>find and kill all sge_execd's on that host, rerun
>>>>$SGE_ROOT/defalut/common/sgeexecd as root, verify it starts via grepping ps.
>>>>
>>>>----- Original Message ----- 
>>>>From: "Chunyan Wang" <wangch at cpsc.ucalgary.ca>
>>>>To: <users at gridengine.sunsource.net>
>>>>Sent: Wednesday, February 16, 2005 12:19 PM
>>>>Subject: [GE users] problem with install a new SGE execution host
>>>>
>>>>
>>>> 
>>>>
>>>>        
>>>>
>>>>>Hi all,
>>>>>I have sge6.3 running. I want to install another execution host on sge-a
>>>>>host. I run install_execd script on sge-a. We share $SGE_ROOT to all
>>>>>hosts. I created a queue for sge-a, and the queue is in "au" state, this
>>>>>means no report information from sge-a host. I checked execd is not
>>>>>running on sge-a host. I found an error message on sge-a host:
>>>>>[ sge-a:/tmp ]
>>>>>(wangc)$ ls
>>>>>execd_messages.300  execd_messages.571  execd_messages.637
>>>>>execd_messages.699
>>>>>
>>>>>[ sge-a:/tmp ]
>>>>>(wangc)$ cat execd_messages.637
>>>>>02/15/2005 19:52:49|execd|sge-a|C|can't create execd handle for "execd"
>>>>>with id 1, using port 537
>>>>>02/15/2005 19:52:50|execd|sge-a|C|can't create execd handle for "execd"
>>>>>with id 1, using port 537
>>>>>02/15/2005 19:52:51|execd|sge-a|C|can't create execd handle for "execd"
>>>>>with id 1, using port 537
>>>>>02/15/2005 19:52:52|execd|sge-a|C|can't create execd handle for "execd"
>>>>>with id 1, using port 537
>>>>>02/15/2005 19:52:53|execd|sge-a|C|can't create execd handle for "execd"
>>>>>with id 1, using port 537
>>>>>
>>>>>Port 536 and 537 are open. root access on sge-a.
>>>>>I check the discussion list, and found someone suggested use local spool
>>>>>directory for the new exection host.
>>>>>Any suggestions about this problem?
>>>>>
>>>>>Thanks alot!
>>>>>
>>>>>Joyce
>>>>>
>>>>>
>>>>>---------------------------------------------------------------------
>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>   
>>>>>
>>>>>          
>>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>> 
>>>>
>>>>        
>>>>
>
>  
>




More information about the gridengine-users mailing list