[GE users] Basic steps Mpich & SGE loose or tight integration

Reuti reuti at staff.uni-marburg.de
Thu Aug 24 20:57:34 BST 2006


Hi,

Am 24.08.2006 um 20:10 schrieb Marcel Mohr:

> Dear SGE Users
>
> i'm having SGE 6.0 and Mpich 2.0 on a little beowulf cluster.
> Both work well seperatly ;-)
>
> First of all, I guess mpich with mpd doesn't work well.
>
> So I need the smpd version, which i also compiled and works well  
> WITHOUT SGE.
>
> I downloaded Reutis scripts
> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2- 
> integration.html
> and modified to use rsh instead of ssh in start_mpich2.c

to use rsh or ssh? You must stay with a plain rsh to get the rsh- 
wrapper from SGE working.

>
> configured my mpich pe:
>
> pe_name           mpich
> slots             10
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /usr/local/SGE/mpich2_smpd/startmpich2.sh - 
> catch_rsh \
>                   $pe_hostfile /usr/local/mpich2-smpd
> stop_proc_args    /usr/local/SGE/mpich2_smpd/stopmpich2.sh  
> $pe_hostfile \
>                   /usr/local/mpich2-smpd

Don't include the $pe_hostfile here. The line should read:

stop_proc_args /usr/local/SGE/mpich2_smpd/stopmpich2.sh -catch_rsh / 
usr/local/mpich2-smpd

> allocation_rule   $pe_slots

How many CPUs are in each machine? $pe_slots will allocate all of the  
requested slots on one machine. Maybe $round_robin will help in your  
case.

> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> and can run jobs
> qsub -p 0 -pe mpich 1 mpich2.sh
> which work well, but ONLY if I use 1 processor.
>
> If I use 2 or more they wait in cluster queue (I guess forever)  
> with the notification
> cannot run in PE "mpich" because it only offers 4 slots
>
> Because of thinking an error could occur due to the initial script
> I tried to submit them
>
>> cat job.startmpich
> #$ -cwd
> sh startmpich2.sh -catch_rsh $pe_hostfile /usr/local/mpich2-smpd

In the jobscript will need just the mpiexec.

-- Reuti

>> qsub -pe mpich 1 job.startmpich:
>
> and obtain:
> startmpich2.sh: check for smpd daemons (1 of 10)
> startmpich2.sh: missing smpd on david007
> startmpich2.sh: check for smpd daemons (2 of 10)
> startmpich2.sh: found running smpd on david007
> startmpich2.sh: got all 1 of 1 nodes
>
>
> so they seem to work.
>
> (Actually I get the following error from the stop script:
> /usr/local/SGE/mpich2_smpd/stopmpich2.sh: line 117: /home/ 
> SGE_spool//david007/active_jobs/1657.1/pe_hostfile/bin/smpd: Not a  
> directory )
> but that's not too important.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list