[GE users] mpich <-> sge --> controlling hosts machinefile

Chris Dagdigian dag at sonsorol.org
Wed Jul 4 20:28:49 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


Not sure if this totally answers your question but you can play with  
the host selection process by adjusting your $allocation_rule in your  
parallel environment configuration.

For instance, you have $fill_up configured which is why your parallel  
slots are being packed on as few nodes as possible. Changing to  
$round_robin will spread it out among as many machines as possible.

For your main symptom:

If your parallel jobs are running more than 2 tasks per node then  
something may be off with your slot count - perhaps SGE is detecting  
multi-core CPUs on your 2-way boxes and setting slots=4 on each node.  
Posting the config of the queue "mpich-qeueue" may help get to the  
bottom of this as I'm not sure about the n_slots "limit" you are  
referring to.


Regards,
Chris



On Jul 4, 2007, at 3:14 PM, Gerolf Ziegenhain wrote:

> Hi,
>
> Maybe it is a very stupid question, but: How do I control the  
> number of jobs per node? Consider the following hardware: 38 nodes  
> with two processors on each. When I start a job with -pe mpich 8  
> there should be 4 nodes used with 2 jobs on each. What do I have to  
> do in order to achieve this?
>
> My parallel environment is configured like this:
> qconf -sp mpich
> pe_name           mpich
> slots             60
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /opt/N1GE/mpi/startmpi.sh -catch_rsh $pe_hostfile
> stop_proc_args    /opt/N1GE/mpi/stopmpi.sh
> allocation_rule   $fill_up
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> My mpich-queue has limits:
> np_load_av=1
> np_load_sh=1
> n_slots=2
>
> However if I start a job, something like this will happen in the  
> PI1234-file:
> lc12.rhrk.uni-kl.de 0 prog
> lc19 1 prog
> lc19 1 prog
> lc19 1 prog
> lc14 1 prog
> lc14 1 prog
> lc13 1 prog
> lc13 1 prog
>
> So there are particularly three jobs on lc19 with only two CPUs, On  
> of these three jobs would better be running on lc12. How can I fix  
> this?
>
>
> Thanks in advance:
>    Gerolf
>
>
>
>
> -- 
> Dipl. Phys. Gerolf Ziegenhain
> Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU Kaiserslautern  
> - Germany
> Web: gerolf.ziegenhain.com
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list