[GE users] Node allocation considering network topolgy

Richard Ems r.ems at gmx.net
Sat Mar 4 19:00:13 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti, I tried as you suggested, but it still doesn't work.
Jobs don't get started and "qstat -j nnn" says:

...
cannot run in PE "mpich_09" because it only offers 0 slots

The queue is configured with 8 processors and 1 slot. I set the number
of slots to 8 and the I get:

...
cannot run in PE "mpich_09" because it only offers 64 slots

What's happening?
Should I set 8 processors or 8 slots on the queue? Or both?

Thanks again, Richard


Here the configurations:

# qconf -sq cluster09.q
qname                 cluster09.q
hostlist              @cluster09
seq_no                9
load_thresholds       np_load_avg=1.50
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            8
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               mpich_09
rerun                 FALSE
slots                 8
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY



# qconf -sp mpich_09
pe_name           mpich_09
slots             8
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $pe_slots
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min



# qstat -j 841
==============================================================
job_number:                 841
exec_file:                  job_scripts/841
submission_time:            Sat Mar  4 19:45:23 2006
owner:                      ems
uid:                        501
group:                      users
gid:                        100
sge_o_home:                 /net/fs02/home/ems
sge_o_log_name:             ems
sge_o_path:
/opt/sge/bin/lx24-x86:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:/opt/kde3/bin:/usr/lib/java/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /net/fs02/home/ems/SGE/test
sge_o_host:                 fs02
account:                    sge
cwd:                        /net/fs02/home/ems/SGE/test
path_aliases:               /tmp_mnt/ * * /
mail_options:               abes
mail_list:                  ems at fs02
notify:                     FALSE
job_name:                   RUN-SGE-test.sh
priority:                   600
jobshare:                   0
env_list:
script_file:                RUN-SGE-test.sh
parallel environment:  mpich_09 range: 8
scheduling info:            queue instance "c_para at cn09001" dropped
because it is temporarily not available
                            queue instance "c_para at cn10001" dropped
because it is temporarily not available
                            queue instance "c_para at cn21001" dropped
because it is overloaded: np_load_avg=0.460000 (no load adjustment) >= 0.25
                            queue instance "c_para at cn13001" dropped
because it is overloaded: np_load_avg=0.460000 (no load adjustment) >= 0.25
                            queue instance "c_para at cn16001" dropped
because it is overloaded: np_load_avg=0.455000 (no load adjustment) >= 0.25
                            queue instance "c_para at cn17001" dropped
because it is overloaded: np_load_avg=0.440000 (no load adjustment) >= 0.25
                            queue instance "c_para at cn12001" dropped
because it is overloaded: np_load_avg=0.465000 (no load adjustment) >= 0.25
                            queue instance "c_para at cn15001" dropped
because it is overloaded: np_load_avg=0.485000 (no load adjustment) >= 0.25
                            queue instance "c_para at cn18001" dropped
because it is overloaded: np_load_avg=0.495000 (no load adjustment) >= 0.25
                            queue instance "c_para at cn20001" dropped
because it is overloaded: np_load_avg=0.460000 (no load adjustment) >= 0.25
                            queue instance "c_para at cn24001" dropped
because it is overloaded: np_load_avg=0.485000 (no load adjustment) >= 0.25
                            queue instance "c_para at cn14001" dropped
because it is overloaded: np_load_avg=0.475000 (no load adjustment) >= 0.25
                            queue instance "c_para at cn22001" dropped
because it is overloaded: np_load_avg=0.490000 (no load adjustment) >= 0.25
                            queue instance "c_para at cn11001" dropped
because it is overloaded: np_load_avg=0.485000 (no load adjustment) >= 0.25
                            queue instance "c_para at cn19001" dropped
because it is overloaded: np_load_avg=0.485000 (no load adjustment) >= 0.25
                            queue instance "c_para at cn23001" dropped
because it is overloaded: np_load_avg=0.285000 (no load adjustment) >= 0.25
                            queue instance "cluster12.q at cn12001" dropped
because it is disabled
                            queue instance "cluster12.q at cn12002" dropped
because it is disabled
                            queue instance "cluster12.q at cn12003" dropped
because it is disabled
                            queue instance "cluster12.q at cn12004" dropped
because it is disabled
                            queue instance "cluster12.q at cn12005" dropped
because it is disabled
                            queue instance "cluster12.q at cn12006" dropped
because it is disabled
                            queue instance "cluster12.q at cn12007" dropped
because it is disabled
                            queue instance "cluster12.q at cn12008" dropped
because it is disabled
                            queue instance "cluster11.q at cn11001" dropped
because it is disabled
                            queue instance "cluster11.q at cn11002" dropped
because it is disabled
                            queue instance "cluster11.q at cn11003" dropped
because it is disabled
                            queue instance "cluster11.q at cn11004" dropped
because it is disabled
                            queue instance "cluster11.q at cn11005" dropped
because it is disabled
                            queue instance "cluster11.q at cn11006" dropped
because it is disabled
                            queue instance "cluster11.q at cn11007" dropped
because it is disabled
                            queue instance "cluster11.q at cn11008" dropped
because it is disabled
                            cannot run in queue instance
"cluster10.q at cn10007" because PE "mpich_09" is not in pe list
                            cannot run in queue instance
"cluster10.q at cn10003" because PE "mpich_09" is not in pe list
                            cannot run in queue instance
"cluster10.q at cn10004" because PE "mpich_09" is not in pe list
                            cannot run in queue instance
"cluster10.q at cn10006" because PE "mpich_09" is not in pe list
                            cannot run in queue instance
"cluster10.q at cn10005" because PE "mpich_09" is not in pe list
                            cannot run in queue instance
"cluster10.q at cn10008" because PE "mpich_09" is not in pe list
                            cannot run in queue instance
"cluster10.q at cn10002" because PE "mpich_09" is not in pe list
                            cannot run in queue instance
"cluster10.q at cn10001" because PE "mpich_09" is not in pe list
                            cannot run in PE "mpich_09" because it only
offers 64 slots




Reuti wrote:
> Hi Richard,
> 
> you can use a setup like described here:
> 
> http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=10915
> 
> (now it's working as intended). To summarize, you will need:
> 
> - two hostgroups
> - two PEs with names which can be  by using a wildcard, like "mpich_a",
> "mpich_b"
>   (allocation will be just inside the hostgroups, so it depends on your
> application which
>   allocation rule you prefer in these PEs, and it will not change this
> setup)
> - two queues, each with one hostgroup attached and one PE attached
> - request eg. "-pe mpich* 4"
> 
> This way the jobs will only get slots from one of the two queues.
> 
> To delegate serial jobs to one part of the cluster until it's full, you
> could give different sequence numbers to them (e.g. 10 for the one part:
> seq_no 0,[@group1=10], and 20 for the other in the queue definition) and
> set in the schedule (qconf -msconf => queue_sort_method seqno). For more
> than two subclusters, you need just more hostgroups/PEs/queues and
> continue with 30, 40,...
> 
> -- Reuti
> 
> 
> Am 11.02.2006 um 16:33 schrieb Richard Ems:
> 
>> Hi all!
>>
>> We have a cluster composed of several "subclusters". Each subcluster has
>> 8 nodes and is connected over a first switch to the master switch.
>>
>>
>>         subcluster 1                         subcluster 2         ...
>> n11 n12 n13 n14 n15 n16 n17 n18      n21 n22 n23 n24 n25 n26 n27 n28
>>  |   |   |   |   |   |   |   |        |   |   |   |   |   |   |   |
>>  |   |   |   |   |   |   |   |        |   |   |   |   |   |   |   |
>> -------------------------------      -------------------------------
>>         switch 1                             switch 2
>> -------------------------------      -------------------------------
>>            |                                    |
>>            |                                    |
>>           ----------------------------------------
>>                         master switch
>>           ----------------------------------------
>>                               |
>>                               |
>>                        -------------
>>                         master node
>>                        -------------
>>
>> One of the applications running on the cluster needs 8 nodes. We want to
>> configure the queue (queues?) to allocate only a full subcluster to a
>> job and not to spawn over to another subcluster.
>>
>> I'm sure this is somehow possible, but I was not able to find how!
>>
>> Later jobs needing 4/2/1 nodes will also be running on the cluster.
>> How can we configure a queue to "fillup" (pe fillup?) a subcluster
>> before selecting free slots from the next subcluster? Not doing so could
>> mean to have two 1 slot jobs running on n11 and n21 and so blocking an 8
>> slots job!
>>
>> Many thanks, Richard
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list