[GE users] setting up mpich2 pe + qrsh

jeroen.m.kleijer at philips.com jeroen.m.kleijer at philips.com
Thu Feb 10 12:41:50 GMT 2005


Hi,

I've figured out what the problem was.
The message:
 'error: getting configuration: unable to send message to qmaster using 
port 0 on host "<qmastername>": no valid port number'

comes from qrsh with the -inherit option.
Because I need this inherit option for the PE environment I checked why it 
failed and in the manual (which I should've consulted earlier) they talk 
about a SGE_QMASTER_PORT variable. I checked in my PE start script if this 
variable was set and to my surprise it wasn't.
I read a little further on the web about this variable and this variable 
is either set when sourcing settings.[c]sh or it uses the services file. 
(which in my case is spread via NIS)
Though my NIS configuration is correct it wouldn't use the sge_qmaster 
setting provided via services (via NIS) so I had to edit settings.sh and 
adjust:
unset SGE_QMASTER_PORT
unset SGE_EXECD_PORT
to
SGE_QMASTER_PORT=536 ; export SGE_QMASTER_PORT
SGE_EXECD_PORT=537 ; export SGE_EXECD_PORT

The qrsh messages are gone now and I'm a bit further down the road but I 
do have one question left regarding your startmpi.sh script.
In this script you generate a (random) port number for the smpd processes.
How do you notify the script which you submit (after SGE has started the 
pe through startmpi.sh) of the randomly generated port number?
As far as I can tell this variable is not known outside of the startmpi.sh 
script so when I do 'qsub <some script>'
where <somescript> has the line: mpiexec -np $NPSLOTS -p $SGE_PORTID 
-machinefile $TMPDIR/machines cpi.
This fails because SGE_PORTID is not known in this script but mpiexec 
needs to know at which port the smpd processes are running.

How did you tackle this problem?

Kind regards,

Jeroen Kleijer

Date: Thu, 10 Feb 2005 12:09:38 +0100
From: Reuti <reuti at staff.uni-marburg.de>
Content-Type: text/plain; charset=ISO-8859-1
Subject: Re: [GE users] setting up mpich2 pe + qrsh


Hi,

the -inherit will tell the qrsh, that it's running in an already setup job 

environment. This can't be used from the command line to start a job, it's 
okay 
as it behaves.

The output files, which will be created by your job, i.e. the pe... and 
po..., 
list the hosts for the parallel job in a proper way by the cat command in 
the 
start procedure? Are these the correct hostnames?

CU - Reuti

Quoting Jeroen Kleijer <jeroen.kleijer at xs4all.nl>:

> 
> Hi
> 
> I'm running SGE6.0u3.
> This is my first attempt at setting up a p.e. environment so I don't
> have any other parallel applications running with SGE.
> 
> Qrsh works properly. When run by hand it gives me a remote shell but
> running it the same way as in the script by hand gives me an error about
> the JOBID not being set.
> 
> Kind regards,
> 
> Jeroen Kleijer
> 
> On Thu, Feb 10, 2005 at 12:04:07AM +0100, Reuti wrote:
> > Hi,
> > 
> > which SGE version are you using? When you run other parallel 
applications,
> the 
> > qrsh is working as it should?
> > 
> > CU - Reuti
> > 
> > Quoting Jeroen Kleijer <jeroen.kleijer at xs4all.nl>:
> > 
> > > 
> > > Hi all,
> > > 
> > > I'm setting up an MPICH2 parallel environment with tight integration
> > > according to the hints given by Reuti in post:
> > >
> 
http://gridengine.sunsource.net/servlets/ReadMsg?msgId=2291&listName=users
> > > 
> > > I compiled mpich2 (with the PGI compiler suite), created a parallel
> > > environment mpich2 which in turn runs the script startmpich2.sh as 
done
> > > by Reuti. (it had some minor errors in it but these were easily 
fixed)
> > > 
> > > The problem I'm running into at the moment is that I want to use the
> > > smpd solution provided in the post and thus, the startmpich2.sh 
script
> > > needs to do a qrsh to every machine in the $machines file and start 
a
> > > smpd daemon.
> > > 
> > > With every qrsh I run from startmpich2.sh I get the following error:
> > > 
> > > error: getting configuration: unable to send message to qmaster 
using
> > > port 0 on host "<qmastername>": no valid port number
> > > error:
> > > Cannot get configuration from qmaster
> > > 
> > > The qrsh command in the script looks like this:
> > > $SGE_ROOT/bin/$ARC/qrsh -V -inherit $node 
"/cadappl/mpich2/1.0/bin/smpd
> > > -s -port $SGE_PORTID"
> > > 
> > > It doesn't really matter what command I use instead of smpd, I've 
tried
> > > doing a simple mkdir /tmp/$SGE_PORTID and it gave me the same error
> > > message.
> > > 
> > > Has anyone seen this message before?
> > > 
> > > Jeroen Kleijer
> > > 
> > > 
---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > > 
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net





More information about the gridengine-users mailing list