[GE users] setting up mpich2 pe + qrsh

Reuti reuti at staff.uni-marburg.de
Fri Feb 11 10:20:40 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Moin moin Jeroen,

Quoting Jeroen Kleijer <jeroen.kleijer at xs4all.nl>:

<snip>
> error: executing task of job 2212 failed:
> Thu Feb 10 19:35:56 CET 2005 : [ nlcftcs14 != nlcftcs13 ] &&
> /home/sge/bin/lx24-amd64/qrsh -inherit -v SGE_QMASTER_PORT nlcftcs13
> "mkdir /volumes/scratch/2212.1.batch.q"
> error: executing task of job 2212 failed:
> Thu Feb 10 19:35:56 CET 2005 : /home/sge/bin/lx24-amd64/qrsh -inherit 
> -v SGE_QMASTER_PORT nlcftcs13 "/cadappl/mpich2/1.0/bin/smpd -s -port
> 12212"

the error with the mkdir seems to be my fault, as I always mix 5.3/6.0 which we 
use both still. In 6.0 the directory may already be there, since the TMPDIR 
reflects the queue name (which is under 5.3 some part of the node name). But as 
you don't need it at all for MPICH2 to work, simply delete the creation line in 
start_proc_arg and the removal in stop_proc_args. Also we need it only for one 
special application on the 5.3 cluster.

To the start of the smpd: is this directory also on the nodes with your MPICH2 
installation, e.g. sharing "/cadappl" via NFS?

Cheers - Reuti


> What particularly annoys me are the "error: executing task of job"
> messages because they don't give any clarity as to what is going wrong.
> 
> To me the command looks correct but like I said, I'm new to SGE so maybe
> I'm overlooking something.
> 
> Cheers,
> 
> Jeroen Kleijer
> 
> > 
> > > A possibility would be to do:
> > > qrsh -V -inherit -q batch.q@$node "command"
> > > but this would mean that I would have to open the batch.q queue for 
> > > interactive sessions, something I'm not looking forward to.
> > 
> > At this point you got already the queue, since your job is already running.
> No 
> > need to specify it.
> > 
> > Cheers - Reuti
> > 
> > 
> > 
> > > Is there a possibility to do a qrsh command directly to a specified node?
> 
> > > (and thereby defeating the purpose of SGE scheduling, I know)
> > > Or do I still have to do a "regular" rsh command, also smething I'm not
> 
> > > looking forward to.
> > > 
> > > Met vriendelijke groeten / Kind regards
> > > 
> > > Jeroen Kleijer
> > > Unix Systeembeheer
> > > Philips Applied Technologies
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > jeroen.m.kleijer+FromInterNet at philips.com
> > > 2005-02-10 03:52 PM
> > > Please respond to users
> > >  
> > >         To:     users at gridengine.sunsource.net
> > >         cc:     (bcc: Jeroen M. Kleijer/EHV/CFT/PHILIPS)
> > >         Subject:        Re: [GE users] setting up mpich2 pe + qrsh
> > >         Classification: 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Hi Reuti, 
> > > 
> > > My nsswitch.conf uses for "services: files nis" so it should be able to
> 
> > > use the NIS file yet somehow it doesn't. 
> > > 
> > > I liked the idea of the smpd daemon mode you described so that was the
> one 
> > > I wanted to go with (and still do). 
> > > I'll try the idea of calculating the SGE_PORTID in the jobscript as well.
> 
> > > (overlooked that one) 
> > > 
> > > Thanks for all the help so far. 
> > > 
> > > Met vriendelijke groeten / Kind regards
> > > 
> > > Jeroen Kleijer
> > > Unix Systeembeheer
> > > Philips Applied Technologies 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Reuti <reuti at staff.uni-marburg.de> 
> > > 2005-02-10 02:23 PM 
> > > Please respond to users 
> > >         
> > >         To:        users at gridengine.sunsource.net 
> > >         cc:        (bcc: Jeroen M. Kleijer/EHV/CFT/PHILIPS) 
> > >         Subject:        Re: [GE users] setting up mpich2 pe + qrsh 
> > >         Classification:         
> > > 
> > > 
> > > 
> > > 
> > > Hi there,
> > > 
> > > Quoting jeroen.m.kleijer at philips.com:
> > > 
> > > <snip>
> > > > Though my NIS configuration is correct it wouldn't use the sge_qmaster
> 
> > > > setting provided via services (via NIS) so I had to edit settings.sh
> and 
> > > 
> > > > adjust:
> > > > unset SGE_QMASTER_PORT
> > > > unset SGE_EXECD_PORT
> > > > to
> > > > SGE_QMASTER_PORT=536 ; export SGE_QMASTER_PORT
> > > > SGE_EXECD_PORT=537 ; export SGE_EXECD_PORT
> > > 
> > > did you adjusted the nsswitch.conf, so that NIS is also used to get 
> > > services 
> > > from the NIS server?
> > > 
> > > > The qrsh messages are gone now and I'm a bit further down the road but
> I 
> > > 
> > > > do have one question left regarding your startmpi.sh script.
> > > > In this script you generate a (random) port number for the smpd 
> > > processes.
> > > > How do you notify the script which you submit (after SGE has started
> the 
> > > 
> > > > pe through startmpi.sh) of the randomly generated port number?
> > > > As far as I can tell this variable is not known outside of the 
> > > startmpi.sh 
> > > > script so when I do 'qsub <some script>'
> > > > where <somescript> has the line: mpiexec -np $NPSLOTS -p $SGE_PORTID 
> > > > -machinefile $TMPDIR/machines cpi.
> > > > This fails because SGE_PORTID is not known in this script but mpiexec
> 
> > > > needs to know at which port the smpd processes are running.
> > > 
> > > Well, first of all I wasn't sure, what will happen, when two users will
> 
> > > start a 
> > > smpd on one node. Maybe a port is selected randomly by smpd on it's own
> to 
> > > 
> > > avoid conflicts. Therefore I stated, that it's not a complete Howto,
> since 
> > > 
> > > there are still some gaps in the MPICH2 documentation (the daemonless 
> > > version 
> > > isn't mentioned up to now at all).
> > > 
> > > But then the problem would be, that one user can't have two jobs in two
> 
> > > different smpd rings on one node. There should two smpds run and listen
> on 
> > > 
> > > different ports to avoid conflicts between the two jobs. So I got the 
> > > idea, to 
> > > calculate a portnumber from the jobnumber you got. This has to be the
> same 
> > > of 
> > > course in start_proc_args, the script which uses mpiexec and 
> > > stop_proc_args. 
> > > With SGE_PORTID=$((JOB_ID % 500 + 12000)) you can do it, as long as you
> 
> > > don't 
> > > have more than a turnaround of jobs of 500. This can be adjusted of
> course 
> > > for 
> > > a wider range. To be completely on the safe side, there would also be the
> 
> > > need 
> > > to implement a test on all nodes before, whether the port is free at
> all.
> > > 
> > > So, put the calculation of the SGE_PORTID in the script like it's done in
> 
> > > the 
> > > demo script I supplied, and it shouild work. If you don't like the
> daemons 
> > > at 
> > > all, you may look at the daemonless startup:
> > > 
> > >
> http://gridengine.sunsource.net/servlets/ReadMsg?msgId=23231&listName=users
> > > 
> > > 
> > > CU - Reuti
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list