[GE users] Node allocation considering network topolgy

Richard Ems r.ems at gmx.net
Sat Mar 4 20:50:56 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti wrote:

>> But the reported error "cannot run in PE "mpich_09" because it only
>> offers 0 slots" will still be there. I got this error with slots=1.
>> What's happening here?
>>
> 
> Are there serial jobs already running on these machines in this
> queue(you could even remove "BATCH INTERACTIVE" to get a pure parallel
> queue - depends on your setup of course? - Reuti

There were jobs running on other queues, but now there is nothing
running on all 8 nodes from cluster09, loads are near 0 on all nodes.

qstat -j nnn still reports the same error: "cannot run in PE "mpich_09"
because it only offers 0 slots".


# qstat -j 841
==============================================================
job_number:                 841
exec_file:                  job_scripts/841
submission_time:            Sat Mar  4 19:45:23 2006
owner:                      ems
uid:                        501
group:                      users
gid:                        100
sge_o_home:                 /net/fs02/home/ems
sge_o_log_name:             ems
sge_o_path:
/opt/sge/bin/lx24-x86:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:/opt/kde3/bin:/usr/lib/java/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /net/fs02/home/ems/SGE/test
sge_o_host:                 fs02
account:                    sge
cwd:                        /net/fs02/home/ems/SGE/test
path_aliases:               /tmp_mnt/ * * /
mail_options:               abes
mail_list:                  ems at fs02
notify:                     FALSE
job_name:                   RUN-SGE-test.sh
priority:                   600
jobshare:                   0
env_list:
script_file:                RUN-SGE-test.sh
parallel environment:  mpich_09 range: 8
scheduling info:            queue instance "c_para at cn13001" dropped
because it is overloaded: np_load_avg=0.490000 (= 0.490000 + 0.50 *
0.000000 with nproc=1) >= 0.25
                            queue instance "c_para at cn24001" dropped
because it is overloaded: np_load_avg=0.465000 (= 0.465000 + 0.50 *
0.000000 with nproc=1) >= 0.25
                            queue instance "c_para at cn14001" dropped
because it is overloaded: np_load_avg=0.455000 (= 0.455000 + 0.50 *
0.000000 with nproc=1) >= 0.25
                            queue instance "c_para at cn22001" dropped
because it is overloaded: np_load_avg=0.485000 (= 0.485000 + 0.50 *
0.000000 with nproc=1) >= 0.25
                            queue instance "c_para at cn11001" dropped
because it is overloaded: np_load_avg=0.455000 (= 0.455000 + 0.50 *
0.000000 with nproc=1) >= 0.25
                            queue instance "c_para at cn19001" dropped
because it is overloaded: np_load_avg=0.460000 (= 0.460000 + 0.50 *
0.000000 with nproc=1) >= 0.25
                            queue instance "c_para at cn23001" dropped
because it is overloaded: np_load_avg=0.455000 (= 0.455000 + 0.50 *
0.000000 with nproc=1) >= 0.25
                            queue instance "c_para at cn18001" dropped
because it is overloaded: np_load_avg=0.450000 (= 0.450000 + 0.50 *
0.000000 with nproc=1) >= 0.25
                            queue instance "c_para at cn21001" dropped
because it is overloaded: np_load_avg=0.460000 (= 0.460000 + 0.50 *
0.000000 with nproc=1) >= 0.25
                            queue instance "c_para at cn20001" dropped
because it is overloaded: np_load_avg=0.480000 (= 0.480000 + 0.50 *
0.000000 with nproc=1) >= 0.25
                            queue instance "c_para at cn15001" dropped
because it is overloaded: np_load_avg=0.470000 (= 0.470000 + 0.50 *
0.000000 with nproc=1) >= 0.25
                            queue instance "c_para at cn12001" dropped
because it is overloaded: np_load_avg=0.465000 (= 0.465000 + 0.50 *
0.000000 with nproc=1) >= 0.25
                            queue instance "c_para at cn16001" dropped
because it is overloaded: np_load_avg=0.357500 (= 0.325000 + 0.50 *
0.130000 with nproc=1) >= 0.25
                            queue instance "c_para at cn17001" dropped
because it is overloaded: np_load_avg=0.425000 (= 0.360000 + 0.50 *
0.260000 with nproc=1) >= 0.25
                            queue instance "cluster12.q at cn12001" dropped
because it is disabled
                            queue instance "cluster12.q at cn12002" dropped
because it is disabled
                            queue instance "cluster12.q at cn12003" dropped
because it is disabled
                            queue instance "cluster12.q at cn12004" dropped
because it is disabled
                            queue instance "cluster12.q at cn12005" dropped
because it is disabled
                            queue instance "cluster12.q at cn12006" dropped
because it is disabled
                            queue instance "cluster12.q at cn12007" dropped
because it is disabled
                            queue instance "cluster12.q at cn12008" dropped
because it is disabled
                            queue instance "cluster11.q at cn11001" dropped
because it is disabled
                            queue instance "cluster11.q at cn11002" dropped
because it is disabled
                            queue instance "cluster11.q at cn11003" dropped
because it is disabled
                            queue instance "cluster11.q at cn11004" dropped
because it is disabled
                            queue instance "cluster11.q at cn11005" dropped
because it is disabled
                            queue instance "cluster11.q at cn11006" dropped
because it is disabled
                            queue instance "cluster11.q at cn11007" dropped
because it is disabled
                            queue instance "cluster11.q at cn11008" dropped
because it is disabled
                            cannot run in queue instance
"cluster10.q at cn10006" because PE "mpich_09" is not in pe list
                            cannot run in queue instance
"cluster10.q at cn10005" because PE "mpich_09" is not in pe list
                            cannot run in queue instance
"cluster10.q at cn10008" because PE "mpich_09" is not in pe list
                            cannot run in queue instance
"cluster10.q at cn10003" because PE "mpich_09" is not in pe list
                            cannot run in queue instance
"cluster10.q at cn10007" because PE "mpich_09" is not in pe list
                            cannot run in queue instance
"cluster10.q at cn10002" because PE "mpich_09" is not in pe list
                            cannot run in queue instance
"cluster10.q at cn10004" because PE "mpich_09" is not in pe list
                            cannot run in queue instance
"cluster10.q at cn10001" because PE "mpich_09" is not in pe list
                            cannot run in PE "mpich_09" because it only
offers 0 slots

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list