[GE users] setting up mpich2 pe + qrsh

Jeroen Kleijer jeroen.kleijer at xs4all.nl
Thu Feb 10 18:36:29 GMT 2005


On Thu, Feb 10, 2005 at 05:51:01PM +0100, Reuti wrote:
> Jeroen,
> 
> the scripts were built for the bash, are you using ksh? Changing 
> shell_start_mode for the queues to unix_behavior might resolve this issue and 
> explain the typos you found.

We do indeed use the Korn-shell round here, basically because we're
coming from a Solaris environment and are gradually phasing in Linux. 

<snip>
> 
> If you secify -inherit, a hostname is allowed. Please have a look at the 
> -inherit option on the qsub man page. So the $node is not resolved as it should 
> be - you really see the printed $ in the output? Can you echo it also and check 
> in the .pe and .po files?.

Sorry, apparently I didn't read the manual page for the -inherit section
as well as I should have.
The code in the startmpi.sh script looks like this:
(I'm not able to cut and paste here, working on two different machines,
so forgive any typos please)

for node in `cat $machines` ; do
  [ "$myhost" != "$node" ] && $SGE_ROOT/bin/$ARC/qrsh -inherit -v \
SGE_QMASTER_PORT -v PWD $node "mkdir $TMPDIR" >> /tmp/stdout 2>&1

  echo "`date` : [ \"$myhost\" != \"$node\" ] && $SGE_ROOT/bin/$ARC/qrsh
-inherit -v SGE_QMASTER_PORT -v PWD $node \"mkdir $TMPDIR\" >>
/tmp/stdout 2>&1 " >> /tmp/stdout

  $SGE_ROOT/bin/$ARC/qrsh -inherit -v SGE_QMASTER_PORT $node \
"/cadappl/mpich2/1.0/bin/smpd -s -port $SGE_PORTID" >> /tmp/stdout 2>&1

  echo "`date` : $SGE_ROOT/bin/$ARC/qrsh -inherit -v SGE_QMASTER_PORT $node \
\"/cadappl/mpich2/1.0/bin/smpd -s -port $SGE_PORTID\" >> /tmp/stdout
2>&1 " >> /tmp/stdout
done

This is basically the same code you had but both are echoed for
debugging.

/tmp/stdout on my execution hosts shows (among other messages) the
following:

error: executing task of job 2212 failed:
Thu Feb 10 19:35:56 CET 2005 : [ nlcftcs14 != nlcftcs13 ] &&
/home/sge/bin/lx24-amd64/qrsh -inherit -v SGE_QMASTER_PORT nlcftcs13
"mkdir /volumes/scratch/2212.1.batch.q"
error: executing task of job 2212 failed:
Thu Feb 10 19:35:56 CET 2005 : /home/sge/bin/lx24-amd64/qrsh -inherit 
-v SGE_QMASTER_PORT nlcftcs13 "/cadappl/mpich2/1.0/bin/smpd -s -port
12212"

What particularly annoys me are the "error: executing task of job"
messages because they don't give any clarity as to what is going wrong.

To me the command looks correct but like I said, I'm new to SGE so maybe
I'm overlooking something.

Cheers,

Jeroen Kleijer

> 
> > A possibility would be to do:
> > qrsh -V -inherit -q batch.q@$node "command"
> > but this would mean that I would have to open the batch.q queue for 
> > interactive sessions, something I'm not looking forward to.
> 
> At this point you got already the queue, since your job is already running. No 
> need to specify it.
> 
> Cheers - Reuti
> 
> 
> 
> > Is there a possibility to do a qrsh command directly to a specified node? 
> > (and thereby defeating the purpose of SGE scheduling, I know)
> > Or do I still have to do a "regular" rsh command, also smething I'm not 
> > looking forward to.
> > 
> > Met vriendelijke groeten / Kind regards
> > 
> > Jeroen Kleijer
> > Unix Systeembeheer
> > Philips Applied Technologies
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > jeroen.m.kleijer+FromInterNet at philips.com
> > 2005-02-10 03:52 PM
> > Please respond to users
> >  
> >         To:     users at gridengine.sunsource.net
> >         cc:     (bcc: Jeroen M. Kleijer/EHV/CFT/PHILIPS)
> >         Subject:        Re: [GE users] setting up mpich2 pe + qrsh
> >         Classification: 
> > 
> > 
> > 
> > 
> > 
> > Hi Reuti, 
> > 
> > My nsswitch.conf uses for "services: files nis" so it should be able to 
> > use the NIS file yet somehow it doesn't. 
> > 
> > I liked the idea of the smpd daemon mode you described so that was the one 
> > I wanted to go with (and still do). 
> > I'll try the idea of calculating the SGE_PORTID in the jobscript as well. 
> > (overlooked that one) 
> > 
> > Thanks for all the help so far. 
> > 
> > Met vriendelijke groeten / Kind regards
> > 
> > Jeroen Kleijer
> > Unix Systeembeheer
> > Philips Applied Technologies 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Reuti <reuti at staff.uni-marburg.de> 
> > 2005-02-10 02:23 PM 
> > Please respond to users 
> >         
> >         To:        users at gridengine.sunsource.net 
> >         cc:        (bcc: Jeroen M. Kleijer/EHV/CFT/PHILIPS) 
> >         Subject:        Re: [GE users] setting up mpich2 pe + qrsh 
> >         Classification:         
> > 
> > 
> > 
> > 
> > Hi there,
> > 
> > Quoting jeroen.m.kleijer at philips.com:
> > 
> > <snip>
> > > Though my NIS configuration is correct it wouldn't use the sge_qmaster 
> > > setting provided via services (via NIS) so I had to edit settings.sh and 
> > 
> > > adjust:
> > > unset SGE_QMASTER_PORT
> > > unset SGE_EXECD_PORT
> > > to
> > > SGE_QMASTER_PORT=536 ; export SGE_QMASTER_PORT
> > > SGE_EXECD_PORT=537 ; export SGE_EXECD_PORT
> > 
> > did you adjusted the nsswitch.conf, so that NIS is also used to get 
> > services 
> > from the NIS server?
> > 
> > > The qrsh messages are gone now and I'm a bit further down the road but I 
> > 
> > > do have one question left regarding your startmpi.sh script.
> > > In this script you generate a (random) port number for the smpd 
> > processes.
> > > How do you notify the script which you submit (after SGE has started the 
> > 
> > > pe through startmpi.sh) of the randomly generated port number?
> > > As far as I can tell this variable is not known outside of the 
> > startmpi.sh 
> > > script so when I do 'qsub <some script>'
> > > where <somescript> has the line: mpiexec -np $NPSLOTS -p $SGE_PORTID 
> > > -machinefile $TMPDIR/machines cpi.
> > > This fails because SGE_PORTID is not known in this script but mpiexec 
> > > needs to know at which port the smpd processes are running.
> > 
> > Well, first of all I wasn't sure, what will happen, when two users will 
> > start a 
> > smpd on one node. Maybe a port is selected randomly by smpd on it's own to 
> > 
> > avoid conflicts. Therefore I stated, that it's not a complete Howto, since 
> > 
> > there are still some gaps in the MPICH2 documentation (the daemonless 
> > version 
> > isn't mentioned up to now at all).
> > 
> > But then the problem would be, that one user can't have two jobs in two 
> > different smpd rings on one node. There should two smpds run and listen on 
> > 
> > different ports to avoid conflicts between the two jobs. So I got the 
> > idea, to 
> > calculate a portnumber from the jobnumber you got. This has to be the same 
> > of 
> > course in start_proc_args, the script which uses mpiexec and 
> > stop_proc_args. 
> > With SGE_PORTID=$((JOB_ID % 500 + 12000)) you can do it, as long as you 
> > don't 
> > have more than a turnaround of jobs of 500. This can be adjusted of course 
> > for 
> > a wider range. To be completely on the safe side, there would also be the 
> > need 
> > to implement a test on all nodes before, whether the port is free at all.
> > 
> > So, put the calculation of the SGE_PORTID in the script like it's done in 
> > the 
> > demo script I supplied, and it shouild work. If you don't like the daemons 
> > at 
> > all, you may look at the daemonless startup:
> > 
> > http://gridengine.sunsource.net/servlets/ReadMsg?msgId=23231&listName=users
> > 
> > 
> > CU - Reuti
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > 
> > 
> > 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list