[GE users] Getting mvapich tight integration working

Christian.Reissmann at Sun.COM Christian.Reissmann at Sun.COM
Wed Aug 2 12:02:22 BST 2006


Hi Daryl,

looked at the commlib output and it seems that all commlib specific problems are resolved now.

 From the output I can see that something with the rcmd is not working (rcmd: socket: Permission 
denied) which might be a problem of the rsh code. For further debuging you might use the
dl.csh / dl 1 or 2 options.

Regards,

Christian


cl_commlib_check_for_ack() [cl_commlib.c/4920] application thread  => no threads enabled
cl_com_connection_complete_request() [cl_communication.c/4472] application thread  => connection 
state: CL_COM_SEND_READ_GMSH
cl_com_tcp_read_GMSH() [cl_tcp_framework.c/979] application thread  => uncomplete read: couldn't 
read all data
cl_commlib_check_for_ack() [cl_commlib.

rcmd: socket: Permission denied
rcmd: socket: Permission denied
rcmd: socket: Permission denied
rcmd: socket: Permission denied

c/4897] application thread  => message is not acknowledged: 1
cl_commlib_check_for_ack() [cl_commlib.c/4920] application thread  => no threads enabled
cl_com_connection_complete_request() [cl_communication.c/4472] application thread  => connection 
state: CL_COM_SEND_READ_GMSH
cl_com_tcp_read_GMSH() [cl_tcp_framework.c/979] application thread  => uncomplete read: couldn't 
read all data
cl_commlib_check_for_ack() [cl_commlib.c/4897] application thread  => message is not acknowledged: 1
cl_commlib_check_for_ack() [cl_commlib.c/4920] application thread  => no threads enabled



Daryl Herzmann wrote On 08/01/06 15:38,:
> Greetings,
> 
> Thanks for your help!
> 
> On Tue, 1 Aug 2006, Christian.Reissmann at Sun.COM wrote:
> 
>> It seems that the host "compute-0-19.local" can't resolve the ip 
>> address of host "compute-0-22.local". Can you please check the ip 
>> resolving on both hosts, so that each host can resolve the ip 
>> adressess of each other.
> 
> 
> [akrherz at compute-0-19 ~]$ 
> /opt/gridengine/utilbin/lx26-amd64/gethostbyaddr -all 192.168.0.232
> error resolving ip "192.168.0.232": can't resolve ip address (h_errno = 
> HOST_NOT_FOUND)
> 
> 
> [akrherz at compute-0-22 ~]$ 
> /opt/gridengine/utilbin/lx26-amd64/gethostbyaddr -all 192.168.0.235
> error resolving ip "192.168.0.235": can't resolve ip address (h_errno = 
> HOST_NOT_FOUND)
> 
> 
> So I tried the /opt/gridengine/default/common/host_aliases file again 
> and restarted sgemaster and sgeexecd everywhere
> 
> [akrherz at compute-0-19 common]$ grep  192.168.0.232 host_aliases
> compute-0-22.local 192.168.0.232
> compute-0-22 192.168.0.232
> c0-22 192.168.0.232
> 
> [akrherz at compute-0-19 common]$ 
> /opt/gridengine/utilbin/lx26-amd64/gethostbyaddr -all 192.168.0.232
> error resolving ip "192.168.0.232": can't resolve ip address (h_errno = 
> HOST_NOT_FOUND)
> 
> So then I tried adding all of the compute nodes to /etc/hosts on all the 
> cluster nodes and it will resolve
> 
> [root at compute-0-19 ~]# /opt/gridengine/utilbin/lx26-amd64/gethostbyaddr 
> -all 192.168.0.232
> Hostname: compute-0-22.local
> SGE name: compute-0-22.local
> Aliases:  compute-0-22
> Host Address(es): 192.168.0.232
> 
> So I tried my test code again and got fewer errors, but still not working.
> 
> http://mesonet.agron.iastate.edu/pickup/mvapich_debug2.txt.gz
> 
> The firewall is off on the cluster nodes.
> 
> Ideas?  thanks!
>    daryl
> 

-- 
Christian Reissmann    Tel: +49 (0)941 3075 112  mailto:crei at sun.com
Software Engineer      Fax: +49 (0)941 3075 222  http://www.sun.com/gridengine
Sun Microsystems GmbH, Dr.-Leo-Ritter-Str. 7,
D-93049 Regensburg,    Tel: +49 (0)941 3075 0

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list