[GE users] Basic steps Mpich & SGE loose or tight integration
reuti at staff.uni-marburg.de
Thu Aug 24 20:57:34 BST 2006
Am 24.08.2006 um 20:10 schrieb Marcel Mohr:
> Dear SGE Users
> i'm having SGE 6.0 and Mpich 2.0 on a little beowulf cluster.
> Both work well seperatly ;-)
> First of all, I guess mpich with mpd doesn't work well.
> So I need the smpd version, which i also compiled and works well
> WITHOUT SGE.
> I downloaded Reutis scripts
> and modified to use rsh instead of ssh in start_mpich2.c
to use rsh or ssh? You must stay with a plain rsh to get the rsh-
wrapper from SGE working.
> configured my mpich pe:
> pe_name mpich
> slots 10
> user_lists NONE
> xuser_lists NONE
> start_proc_args /usr/local/SGE/mpich2_smpd/startmpich2.sh -
> catch_rsh \
> $pe_hostfile /usr/local/mpich2-smpd
> stop_proc_args /usr/local/SGE/mpich2_smpd/stopmpich2.sh
> $pe_hostfile \
Don't include the $pe_hostfile here. The line should read:
stop_proc_args /usr/local/SGE/mpich2_smpd/stopmpich2.sh -catch_rsh /
> allocation_rule $pe_slots
How many CPUs are in each machine? $pe_slots will allocate all of the
requested slots on one machine. Maybe $round_robin will help in your
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> and can run jobs
> qsub -p 0 -pe mpich 1 mpich2.sh
> which work well, but ONLY if I use 1 processor.
> If I use 2 or more they wait in cluster queue (I guess forever)
> with the notification
> cannot run in PE "mpich" because it only offers 4 slots
> Because of thinking an error could occur due to the initial script
> I tried to submit them
>> cat job.startmpich
> #$ -cwd
> sh startmpich2.sh -catch_rsh $pe_hostfile /usr/local/mpich2-smpd
In the jobscript will need just the mpiexec.
>> qsub -pe mpich 1 job.startmpich:
> and obtain:
> startmpich2.sh: check for smpd daemons (1 of 10)
> startmpich2.sh: missing smpd on david007
> startmpich2.sh: check for smpd daemons (2 of 10)
> startmpich2.sh: found running smpd on david007
> startmpich2.sh: got all 1 of 1 nodes
> so they seem to work.
> (Actually I get the following error from the stop script:
> /usr/local/SGE/mpich2_smpd/stopmpich2.sh: line 117: /home/
> SGE_spool//david007/active_jobs/1657.1/pe_hostfile/bin/smpd: Not a
> directory )
> but that's not too important.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users