[GE users] mpich <-> sge --> controlling hosts machinefile

Gerolf Ziegenhain mail.gerolf at ziegenhain.com
Wed Jul 4 20:35:56 BST 2007


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Thanks for the very quick reply ;)

allocation_rule = $round_robin results in 1job/node. This increases the
communication effort. So maybe allocation_rule=2 would be the best choice in
my case?

This is the configuration of the queue:
qconf -sq q_mpich
qname                 q_mpich
hostlist              lc10 lc11 lc12 lc13 lc14 lc15 lc18 lc19
seq_no                21,[@b_hosts=22],[@x_hosts=23]
load_thresholds       np_load_avg=1,np_load_short=1,n_slots=2, \
                      [@b_hosts=np_load_avg=1,np_load_short=1,n_slots=2], \
                      [@x_hosts=np_load_avg=1,np_load_short=1,n_slots=2]
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH
ckpt_list             NONE
pe_list               mpich
rerun                 TRUE
slots                 2
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      unix_behavior
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            ziegen,[@x_hosts=big]
xuser_lists           matlab matlab1 thor
subordinate_list      NONE
complex_values        synchron=0,virtual_free=3G,n_slots=2, \
                      [@b_hosts=synchron=0,virtual_free=5G,n_slots=2], \
                      [@x_hosts=synchron=0,virtual_free=17G,n_slots=2]
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 100:00:00
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                2G,[@b_hosts=4G],[@x_hosts=16G]
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                3G,[@b_hosts=5G],[@x_hosts=17G]


/BR:
   Gerolf


2007/7/4, Chris Dagdigian <dag at sonsorol.org>:
>
>
> Not sure if this totally answers your question but you can play with
> the host selection process by adjusting your $allocation_rule in your
> parallel environment configuration.
>
> For instance, you have $fill_up configured which is why your parallel
> slots are being packed on as few nodes as possible. Changing to
> $round_robin will spread it out among as many machines as possible.
>
> For your main symptom:
>
> If your parallel jobs are running more than 2 tasks per node then
> something may be off with your slot count - perhaps SGE is detecting
> multi-core CPUs on your 2-way boxes and setting slots=4 on each node.
> Posting the config of the queue "mpich-qeueue" may help get to the
> bottom of this as I'm not sure about the n_slots "limit" you are
> referring to.
>
>
> Regards,
> Chris
>
>
>
> On Jul 4, 2007, at 3:14 PM, Gerolf Ziegenhain wrote:
>
> > Hi,
> >
> > Maybe it is a very stupid question, but: How do I control the
> > number of jobs per node? Consider the following hardware: 38 nodes
> > with two processors on each. When I start a job with -pe mpich 8
> > there should be 4 nodes used with 2 jobs on each. What do I have to
> > do in order to achieve this?
> >
> > My parallel environment is configured like this:
> > qconf -sp mpich
> > pe_name           mpich
> > slots             60
> > user_lists        NONE
> > xuser_lists       NONE
> > start_proc_args   /opt/N1GE/mpi/startmpi.sh -catch_rsh $pe_hostfile
> > stop_proc_args    /opt/N1GE/mpi/stopmpi.sh
> > allocation_rule   $fill_up
> > control_slaves    TRUE
> > job_is_first_task FALSE
> > urgency_slots     min
> >
> > My mpich-queue has limits:
> > np_load_av=1
> > np_load_sh=1
> > n_slots=2
> >
> > However if I start a job, something like this will happen in the
> > PI1234-file:
> > lc12.rhrk.uni-kl.de 0 prog
> > lc19 1 prog
> > lc19 1 prog
> > lc19 1 prog
> > lc14 1 prog
> > lc14 1 prog
> > lc13 1 prog
> > lc13 1 prog
> >
> > So there are particularly three jobs on lc19 with only two CPUs, On
> > of these three jobs would better be running on lc12. How can I fix
> > this?
> >
> >
> > Thanks in advance:
> >    Gerolf
> >
> >
> >
> >
> > --
> > Dipl. Phys. Gerolf Ziegenhain
> > Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU Kaiserslautern
> > - Germany
> > Web: gerolf.ziegenhain.com
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


-- 
Dipl. Phys. Gerolf Ziegenhain
Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU Kaiserslautern - Germany
Web: gerolf.ziegenhain.com



More information about the gridengine-users mailing list