[GE users] Re: mpich <-> sge --> controlling hosts machinefile

Gerolf Ziegenhain gerolf.ziegenhain at googlemail.com
Thu Jul 5 11:07:51 BST 2007


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

I was able to track it down even more to this:

ssh lc12 cat /tmp/244224.1.q_mpich/machines
lc12
lc12
lc19
lc19
lc14
lc14
lc13
lc13

The master-node of the 8-processore-job has a good looking machinefile.

The current running process creates from this the following machinefile:
lc12.rhrk.uni-kl.de 0 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
lc19 1 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
lc19 1 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
lc14 1 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
lc14 1 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
lc13 1 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
lc13 1 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
lc19 1 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel

Where is the machinefile created? Is this done by the
$SGE_ROOT/mpi/start.mpi.sh -catch_rsh $pe_hostfile
?

/BR: Gerolf




2007/7/4, Gerolf Ziegenhain <ziegen at rhrk.uni-kl.de>:
>
> Hi,
>
> Maybe it is a very stupid question, but: How do I control the number of
> jobs per node? Consider the following hardware: 38 nodes with two processors
> on each. When I start a job with -pe mpich 8 there should be 4 nodes used
> with 2 jobs on each. What do I have to do in order to achieve this?
>
> My parallel environment is configured like this:
> qconf -sp mpich
> pe_name           mpich
> slots             60
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /opt/N1GE/mpi/startmpi.sh -catch_rsh $pe_hostfile
> stop_proc_args    /opt/N1GE/mpi/stopmpi.sh
> allocation_rule   $fill_up
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> My mpich-queue has limits:
> np_load_av=1
> np_load_sh=1
> n_slots=2
>
> However if I start a job, something like this will happen in the
> PI1234-file:
> lc12.rhrk.uni-kl.de 0 prog
> lc19 1 prog
> lc19 1 prog
> lc19 1 prog
> lc14 1 prog
> lc14 1 prog
> lc13 1 prog
> lc13 1 prog
>
> So there are particularly three jobs on lc19 with only two CPUs, On of
> these three jobs would better be running on lc12. How can I fix this?
>
>
> Thanks in advance:
>    Gerolf
>
>
>
>
> --
> Dipl. Phys. Gerolf Ziegenhain
> Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU Kaiserslautern -
> Germany
> Web: gerolf.ziegenhain.com
>
>


-- 
Dipl. Phys. Gerolf Ziegenhain
Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU Kaiserslautern - Germany
Web: gerolf.ziegenhain.com



More information about the gridengine-users mailing list