[GE users] MPICH ( HP ) tight integration and qrsh looping
dev_hyd2001 at yahoo.com
Wed Jan 30 17:40:54 GMT 2008
[ The following text is in the "iso-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
I'm trying to force the use of the SGE rsh wrapper over qrsh to launch an HP-MPI program by setting up a normal tight integrated mpich parallel environment. Things start up ok on the master node of the mpi job but when it starts to spawn the mpid's on the slave nodes using qrsh instead of the normal rsh things start going crazy!
1) qrsh goes into 50-60 % cpu usage for sometime and then finally returns an error , so hp-mpi complains saying there is some remote connection problem.
2) What I tried was when the qrsh call to spawn the mpid's was in progress, I manually executed the same qrsh call as was being initiated by the mpi application in another shell from the same node ( after setting $JOB_ID, $ARC and $SGE_TASK_ID=1 as required by qrsh -V -inherit ) and found that the call actually worked. In fact the application then continued starting up with all mpi processes launched up correctly!
3) To dig a bit further I turned on SGE debugging with "dl 1" and found that the qrsh call from the master node of the MPI process attempted to do
requesting global and node01.cluster
to->comp_host,to->comp_name,to->comp_id,<hostname of master>/qmaster/1
error:cl_commlib_get_endpoint_status failed:"can't find connection"
then it says
getting configuration: unable to connect to qmaster using port 6444 on host <hostname of qmaster>
Is it something very simple that I miss ?
On the other hand using normal mpich and my own mpi test program , I'm successfully able to run it tightly integrated.
Any ideas ? anyone ?
Looking for last minute shopping deals? Find them fast with Yahoo! Search.
More information about the gridengine-users