[GE users] MPICH ( HP ) tight integration and qrsh looping

Reuti reuti at staff.uni-marburg.de
Wed Jan 30 19:05:36 GMT 2008


Am 30.01.2008 um 18:40 schrieb Dev:

>     I'm trying to force the use of the SGE rsh wrapper over qrsh to  
> launch an HP-MPI program by setting up a normal tight integrated  
> mpich parallel environment. Things start up ok  on the master node  
> of the mpi job but when it starts to spawn the mpid's on the slave  
> nodes using qrsh instead of the normal rsh things start going crazy!
> 1) qrsh goes into 50-60 % cpu usage for sometime and then finally  
> returns an error , so hp-mpi complains saying there is some remote  
> connection problem.
> 2) What I tried was when the qrsh call to spawn the mpid's was in  
> progress, I manually executed the same qrsh call as was being  
> initiated by the mpi application in another shell from the same  
> node ( after setting $JOB_ID, $ARC and $SGE_TASK_ID=1 as required  
> by qrsh -V -inherit ) and found that the call actually worked. In  
> fact the application then continued starting up with all mpi  
> processes launched up correctly!
> 3) To dig a bit further I turned on SGE debugging with "dl 1" and  
> found that the qrsh call from the master node of the MPI process  
> attempted to do
> requesting global and node01.cluster
> to->comp_host,to->comp_name,to->comp_id,<hostname of master>/qmaster/1
> error:cl_commlib_get_endpoint_status failed:"can't find connection"
> then it says
> getting configuration: unable to connect to qmaster using port 6444  
> on host <hostname of qmaster>
> Is it something very simple that I miss ?

which version of HP-MPI do you use?

What does your mpirun commandline look like in your jobscript?

Which environment variables are you setting in your jobscript?

-- Reuti

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list