[GE users] setting up mpich2 pe + qrsh

jeroen.m.kleijer at philips.com jeroen.m.kleijer at philips.com
Fri Feb 11 10:35:32 GMT 2005


Hi Reuti,

Sorry to keep bothering you with this but you (along with Ron Chen, 
Andreas Haas and a host of others who I can't possibly name all) seem to 
be the most active on this mailing list. (for which I am really grateful)

I checked and the $TMPDIR (which is /volumes/scratch/<jobid>.batch.q) is 
created on the starting host of the job (usually the nlcftcs14). This 
directory doesn't get created on the other nodes (nlcftcs12 or 13), 
neither by SGE itself nor the startmpi.sh script.
I'll comment out the mkdir entry.

As for MPICH2, this /cadappl directory is indeed shared via NFS and 
accessible on all systems, so I'm a bit at a loss as to where the message
"error: executing task of job <jobid> failed:"
with nothing to go along with. It seems to be related to qrsh but I can 
run the command with 'regular' rsh just fine.

Met vriendelijke groeten / Kind regards

Jeroen Kleijer
Unix Systeembeheer
Philips Applied Technologies









Reuti <reuti at staff.uni-marburg.de>
2005-02-11 11:20 AM
Please respond to users
 
        To:     users at gridengine.sunsource.net
        cc:     (bcc: Jeroen M. Kleijer/EHV/CFT/PHILIPS)
        Subject:        Re: [GE users] setting up mpich2 pe + qrsh
        Classification: 




Moin moin Jeroen,

Quoting Jeroen Kleijer <jeroen.kleijer at xs4all.nl>:

<snip>
> error: executing task of job 2212 failed:
> Thu Feb 10 19:35:56 CET 2005 : [ nlcftcs14 != nlcftcs13 ] &&
> /home/sge/bin/lx24-amd64/qrsh -inherit -v SGE_QMASTER_PORT nlcftcs13
> "mkdir /volumes/scratch/2212.1.batch.q"
> error: executing task of job 2212 failed:
> Thu Feb 10 19:35:56 CET 2005 : /home/sge/bin/lx24-amd64/qrsh -inherit 
> -v SGE_QMASTER_PORT nlcftcs13 "/cadappl/mpich2/1.0/bin/smpd -s -port
> 12212"

the error with the mkdir seems to be my fault, as I always mix 5.3/6.0 
which we 
use both still. In 6.0 the directory may already be there, since the 
TMPDIR 
reflects the queue name (which is under 5.3 some part of the node name). 
But as 
you don't need it at all for MPICH2 to work, simply delete the creation 
line in 
start_proc_arg and the removal in stop_proc_args. Also we need it only for 
one 
special application on the 5.3 cluster.

To the start of the smpd: is this directory also on the nodes with your 
MPICH2 
installation, e.g. sharing "/cadappl" via NFS?

Cheers - Reuti


> What particularly annoys me are the "error: executing task of job"
> messages because they don't give any clarity as to what is going wrong.
> 
> To me the command looks correct but like I said, I'm new to SGE so maybe
> I'm overlooking something.
> 
> Cheers,
> 
> Jeroen Kleijer
> 
> > 
> > > A possibility would be to do:
> > > qrsh -V -inherit -q batch.q@$node "command"
> > > but this would mean that I would have to open the batch.q queue for 
> > > interactive sessions, something I'm not looking forward to.
> > 
> > At this point you got already the queue, since your job is already 
running.
> No 
> > need to specify it.
> > 
> > Cheers - Reuti
> > 
> > 
> > 
> > > Is there a possibility to do a qrsh command directly to a specified 
node?
> 
> > > (and thereby defeating the purpose of SGE scheduling, I know)
> > > Or do I still have to do a "regular" rsh command, also smething I'm 
not
> 
> > > looking forward to.
> > > 
> > > Met vriendelijke groeten / Kind regards
> > > 
> > > Jeroen Kleijer
> > > Unix Systeembeheer
> > > Philips Applied Technologies
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > jeroen.m.kleijer+FromInterNet at philips.com
> > > 2005-02-10 03:52 PM
> > > Please respond to users
> > > 
> > >         To:     users at gridengine.sunsource.net
> > >         cc:     (bcc: Jeroen M. Kleijer/EHV/CFT/PHILIPS)
> > >         Subject:        Re: [GE users] setting up mpich2 pe + qrsh
> > >         Classification: 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Hi Reuti, 
> > > 
> > > My nsswitch.conf uses for "services: files nis" so it should be able 
to
> 
> > > use the NIS file yet somehow it doesn't. 
> > > 
> > > I liked the idea of the smpd daemon mode you described so that was 
the
> one 
> > > I wanted to go with (and still do). 
> > > I'll try the idea of calculating the SGE_PORTID in the jobscript as 
well.
> 
> > > (overlooked that one) 
> > > 
> > > Thanks for all the help so far. 
> > > 
> > > Met vriendelijke groeten / Kind regards
> > > 
> > > Jeroen Kleijer
> > > Unix Systeembeheer
> > > Philips Applied Technologies 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Reuti <reuti at staff.uni-marburg.de> 
> > > 2005-02-10 02:23 PM 
> > > Please respond to users 
> > > 
> > >         To:        users at gridengine.sunsource.net 
> > >         cc:        (bcc: Jeroen M. Kleijer/EHV/CFT/PHILIPS) 
> > >         Subject:        Re: [GE users] setting up mpich2 pe + qrsh 
> > >         Classification: 
> > > 
> > > 
> > > 
> > > 
> > > Hi there,
> > > 
> > > Quoting jeroen.m.kleijer at philips.com:
> > > 
> > > <snip>
> > > > Though my NIS configuration is correct it wouldn't use the 
sge_qmaster
> 
> > > > setting provided via services (via NIS) so I had to edit 
settings.sh
> and 
> > > 
> > > > adjust:
> > > > unset SGE_QMASTER_PORT
> > > > unset SGE_EXECD_PORT
> > > > to
> > > > SGE_QMASTER_PORT=536 ; export SGE_QMASTER_PORT
> > > > SGE_EXECD_PORT=537 ; export SGE_EXECD_PORT
> > > 
> > > did you adjusted the nsswitch.conf, so that NIS is also used to get 
> > > services 
> > > from the NIS server?
> > > 
> > > > The qrsh messages are gone now and I'm a bit further down the road 
but
> I 
> > > 
> > > > do have one question left regarding your startmpi.sh script.
> > > > In this script you generate a (random) port number for the smpd 
> > > processes.
> > > > How do you notify the script which you submit (after SGE has 
started
> the 
> > > 
> > > > pe through startmpi.sh) of the randomly generated port number?
> > > > As far as I can tell this variable is not known outside of the 
> > > startmpi.sh 
> > > > script so when I do 'qsub <some script>'
> > > > where <somescript> has the line: mpiexec -np $NPSLOTS -p 
$SGE_PORTID 
> > > > -machinefile $TMPDIR/machines cpi.
> > > > This fails because SGE_PORTID is not known in this script but 
mpiexec
> 
> > > > needs to know at which port the smpd processes are running.
> > > 
> > > Well, first of all I wasn't sure, what will happen, when two users 
will
> 
> > > start a 
> > > smpd on one node. Maybe a port is selected randomly by smpd on it's 
own
> to 
> > > 
> > > avoid conflicts. Therefore I stated, that it's not a complete Howto,
> since 
> > > 
> > > there are still some gaps in the MPICH2 documentation (the 
daemonless 
> > > version 
> > > isn't mentioned up to now at all).
> > > 
> > > But then the problem would be, that one user can't have two jobs in 
two
> 
> > > different smpd rings on one node. There should two smpds run and 
listen
> on 
> > > 
> > > different ports to avoid conflicts between the two jobs. So I got 
the 
> > > idea, to 
> > > calculate a portnumber from the jobnumber you got. This has to be 
the
> same 
> > > of 
> > > course in start_proc_args, the script which uses mpiexec and 
> > > stop_proc_args. 
> > > With SGE_PORTID=$((JOB_ID % 500 + 12000)) you can do it, as long as 
you
> 
> > > don't 
> > > have more than a turnaround of jobs of 500. This can be adjusted of
> course 
> > > for 
> > > a wider range. To be completely on the safe side, there would also 
be the
> 
> > > need 
> > > to implement a test on all nodes before, whether the port is free at
> all.
> > > 
> > > So, put the calculation of the SGE_PORTID in the script like it's 
done in
> 
> > > the 
> > > demo script I supplied, and it shouild work. If you don't like the
> daemons 
> > > at 
> > > all, you may look at the daemonless startup:
> > > 
> > >
> 
http://gridengine.sunsource.net/servlets/ReadMsg?msgId=23231&listName=users

> > > 
> > > 
> > > CU - Reuti
> > > 
> > > 
---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net





More information about the gridengine-users mailing list