[GE users] setting up mpich2 pe + qrsh

Reuti reuti at staff.uni-marburg.de
Thu Feb 10 13:23:49 GMT 2005

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi there,

Quoting jeroen.m.kleijer at philips.com:

> Though my NIS configuration is correct it wouldn't use the sge_qmaster 
> setting provided via services (via NIS) so I had to edit settings.sh and 
> adjust:
> to

did you adjusted the nsswitch.conf, so that NIS is also used to get services 
from the NIS server?
> The qrsh messages are gone now and I'm a bit further down the road but I 
> do have one question left regarding your startmpi.sh script.
> In this script you generate a (random) port number for the smpd processes.
> How do you notify the script which you submit (after SGE has started the 
> pe through startmpi.sh) of the randomly generated port number?
> As far as I can tell this variable is not known outside of the startmpi.sh 
> script so when I do 'qsub <some script>'
> where <somescript> has the line: mpiexec -np $NPSLOTS -p $SGE_PORTID 
> -machinefile $TMPDIR/machines cpi.
> This fails because SGE_PORTID is not known in this script but mpiexec 
> needs to know at which port the smpd processes are running.

Well, first of all I wasn't sure, what will happen, when two users will start a 
smpd on one node. Maybe a port is selected randomly by smpd on it's own to 
avoid conflicts. Therefore I stated, that it's not a complete Howto, since 
there are still some gaps in the MPICH2 documentation (the daemonless version 
isn't mentioned up to now at all).

But then the problem would be, that one user can't have two jobs in two 
different smpd rings on one node. There should two smpds run and listen on 
different ports to avoid conflicts between the two jobs. So I got the idea, to 
calculate a portnumber from the jobnumber you got. This has to be the same of 
course in start_proc_args, the script which uses mpiexec and stop_proc_args. 
With SGE_PORTID=$((JOB_ID % 500 + 12000)) you can do it, as long as you don't 
have more than a turnaround of jobs of 500. This can be adjusted of course for 
a wider range. To be completely on the safe side, there would also be the need 
to implement a test on all nodes before, whether the port is free at all.

So, put the calculation of the SGE_PORTID in the script like it's done in the 
demo script I supplied, and it shouild work. If you don't like the daemons at 
all, you may look at the daemonless startup:


CU - Reuti

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list