[GE users] MPICH ( HP ) tight integration and qrsh looping

Dev dev_hyd2001 at yahoo.com
Thu Jan 31 10:13:35 GMT 2008


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

HI Reuti,

The version of HP-MPI is 02.02.00.02 on Linux AMD64

The application first makes an mpirun as follows

This is run on the node 10.0.1.1 

...../lnamd64/hp/bin/mpirun -v -d -TCP -f /tmp/<HP-MPI appfile name>

The HP-MPI app file is dynamically generated by the application and contains entries for 

-h <nodename> -np 1 -e PATH=<long path from the environment> -e LD_LIBRARY_PATH=<long path> -e MPI_WORKDIR=/tmp <executable program name> node -mpiw hp -pic ethernet -mport 10.0.1.1:10.0.1.1:<port_number>:0


Then it tries to launch mpids on the other nodes alloted to the job as follows

..../sge/bin/lx24-amd64/qrsh -verbose -V -nostdin -inherit 10.0.1.2 ...../lnamd64/hp/bin/mpid 1 0 33685506 10.0.1.1 49321 23228 ..../lnamd64/hp

The above is the qrsh command which loops and eventually says could not get configuration from the qmaster.

On the other hand the same call above if done using the OS provided rsh works without any problem.


The only environment variable which I set in my job script is

export PATH=$TMPDIR:$PATH


cheers

/Dev



Reuti <reuti at staff.uni-marburg.de> wrote: Hi,

Am 30.01.2008 um 18:40 schrieb Dev:

>     I'm trying to force the use of the SGE rsh wrapper over qrsh to  
> launch an HP-MPI program by setting up a normal tight integrated  
> mpich parallel environment. Things start up ok  on the master node  
> of the mpi job but when it starts to spawn the mpid's on the slave  
> nodes using qrsh instead of the normal rsh things start going crazy!
>
> 1) qrsh goes into 50-60 % cpu usage for sometime and then finally  
> returns an error , so hp-mpi complains saying there is some remote  
> connection problem.
>
> 2) What I tried was when the qrsh call to spawn the mpid's was in  
> progress, I manually executed the same qrsh call as was being  
> initiated by the mpi application in another shell from the same  
> node ( after setting $JOB_ID, $ARC and $SGE_TASK_ID=1 as required  
> by qrsh -V -inherit ) and found that the call actually worked. In  
> fact the application then continued starting up with all mpi  
> processes launched up correctly!
>
> 3) To dig a bit further I turned on SGE debugging with "dl 1" and  
> found that the qrsh call from the master node of the MPI process  
> attempted to do
>
> requesting global and node01.cluster
> to->comp_host,to->comp_name,to->comp_id,/qmaster/1
> error:cl_commlib_get_endpoint_status failed:"can't find connection"
> then it says
> getting configuration: unable to connect to qmaster using port 6444  
> on host 
>
> Is it something very simple that I miss ?

which version of HP-MPI do you use?

What does your mpirun commandline look like in your jobscript?

Which environment variables are you setting in your jobscript?

-- Reuti



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


 
       
---------------------------------
Never miss a thing.   Make Yahoo your homepage.



More information about the gridengine-users mailing list