[GE users] Node allocation considering network topolgy

Reuti reuti at staff.uni-marburg.de
Sun Mar 5 13:48:23 GMT 2006


Just to be sure: you are running 6.0u7, because of the wildcard bug  
in former releases?

And: can you please post your hostgroup configuration - to me it  
looks like it should work. - Reuti


Am 05.03.2006 um 14:28 schrieb Richard Ems:

> Reuti wrote:
>> Strange: what was your qsub command? - Reuti
>
> The first qsub command was:
>
> # qsub -pe mpich* 8 ~/job1.sh
>
> I also tried with
>
> # qsub -pe mpich* 2 ~/job1.sh
>
> but with the same results "cannot run in PE "mpich_09" because it only
> offers 0 slots"
>
>
>
> # cat job1.sh
> echo hostname=`hostname`
> sleep 30
>
>
>
> # qstat -g c
> CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS   
> cdsuE
> ---------------------------------------------------------------------- 
> ---------
> all.q                             -NA-      0      0      0       
> 0      0
> cluster09.q                   0.00      0      8      8      0      0
> cluster10.q                   0.45      0      7      8      1      0
> cluster11.q                   0.47      0      0      8      0      8
> cluster12.q                   0.47      0      0      8      0      8
> c_para                        0.47     15      0     15     15      0
>
>
>
> # qstat -j 854
> ==============================================================
> job_number:                 854
> exec_file:                  job_scripts/854
> submission_time:            Sun Mar  5 14:21:05 2006
> owner:                      ems
> uid:                        501
> group:                      users
> gid:                        100
> sge_o_home:                 /net/fs02/home/ems
> sge_o_log_name:             ems
> sge_o_path:
> /opt/sge/bin/lx24-x86:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/ 
> usr/games:/opt/gnome/bin:/opt/kde3/bin:/usr/lib/java/b
>        in
> sge_o_shell:                /bin/bash
> sge_o_workdir:              /net/fs02/home/ems
> sge_o_host:                 fs02
> account:                    sge
> mail_list:                  ems at fs02.
> notify:                     FALSE
> job_name:                   job1.sh
> jobshare:                   0
> env_list:
> script_file:                /net/fs02/home/ems/job1.sh
> parallel environment:  mpich* range: 2
> scheduling info:            queue instance "c_para at cn11001" dropped
> because it is overloaded: np_load_avg=0.480000 (no load adjustment) >=
>        0.25
>                             queue instance "c_para at cn19001" dropped
> because it is overloaded: np_load_avg=0.475000 (no load adjustment) >=
>        0.25
>                             queue instance "c_para at cn23001" dropped
> because it is overloaded: np_load_avg=0.435000 (no load adjustment) >=
>        0.25
>                             queue instance "c_para at cn18001" dropped
> because it is overloaded: np_load_avg=0.490000 (no load adjustment) >=
>        0.25
>                             queue instance "c_para at cn21001" dropped
> because it is overloaded: np_load_avg=0.450000 (no load adjustment) >=
>        0.25
>                             queue instance "c_para at cn20001" dropped
> because it is overloaded: np_load_avg=0.465000 (no load adjustment) >=
>        0.25
>                             queue instance "c_para at cn15001" dropped
> because it is overloaded: np_load_avg=0.490000 (no load adjustment) >=
>        0.25
>                             queue instance "c_para at cn12001" dropped
> because it is overloaded: np_load_avg=0.485000 (no load adjustment) >=
>        0.25
>                             queue instance "c_para at cn10001" dropped
> because it is overloaded: np_load_avg=0.485000 (no load adjustment) >=
>        0.25
>                             queue instance "c_para at cn13001" dropped
> because it is overloaded: np_load_avg=0.450000 (no load adjustment) >=
>        0.25
>                             queue instance "c_para at cn17001" dropped
> because it is overloaded: np_load_avg=0.470000 (no load adjustment) >=
>        0.25
>                             queue instance "c_para at cn24001" dropped
> because it is overloaded: np_load_avg=0.455000 (no load adjustment) >=
>        0.25
>                             queue instance "c_para at cn14001" dropped
> because it is overloaded: np_load_avg=0.480000 (no load adjustment) >=
>        0.25
>                             queue instance "c_para at cn16001" dropped
> because it is overloaded: np_load_avg=0.465000 (no load adjustment) >=
>        0.25
>                             queue instance "c_para at cn22001" dropped
> because it is overloaded: np_load_avg=0.375000 (no load adjustment) >=
>        0.25
>                             queue instance "cluster12.q at cn12001"  
> dropped
> because it is disabled
>                             queue instance "cluster12.q at cn12002"  
> dropped
> because it is disabled
>                             queue instance "cluster12.q at cn12003"  
> dropped
> because it is disabled
>                             queue instance "cluster12.q at cn12004"  
> dropped
> because it is disabled
>                             queue instance "cluster12.q at cn12005"  
> dropped
> because it is disabled
>                             queue instance "cluster12.q at cn12006"  
> dropped
> because it is disabled
>                             queue instance "cluster12.q at cn12007"  
> dropped
> because it is disabled
>                             queue instance "cluster12.q at cn12008"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11001"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11002"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11003"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11004"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11005"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11006"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11007"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11008"  
> dropped
> because it is disabled
>                             cannot run in queue instance
> "cluster10.q at cn10005" because PE "mpich_09" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10006" because PE "mpich_09" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10003" because PE "mpich_09" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10007" because PE "mpich_09" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10002" because PE "mpich_09" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10001" because PE "mpich_09" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10008" because PE "mpich_09" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10004" because PE "mpich_09" is not in pe list
>                             cannot run in PE "mpich_09" because it  
> only
> offers 0 slots
>                             cannot run in queue instance
> "cluster09.q at cn09006" because PE "mpich_10" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09005" because PE "mpich_10" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09003" because PE "mpich_10" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09002" because PE "mpich_10" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09007" because PE "mpich_10" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09004" because PE "mpich_10" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09001" because PE "mpich_10" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09008" because PE "mpich_10" is not in pe list
>                             cannot run in PE "mpich_10" because it  
> only
> offers 0 slots
>                             cannot run in queue instance
> "cluster09.q at cn09006" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09005" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09003" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09002" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09007" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09004" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09001" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09008" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10005" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10006" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10003" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10007" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10002" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10001" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10008" because PE "mpich_11" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10004" because PE "mpich_11" is not in pe list
>                             cannot run in PE "mpich_11" because it  
> only
> offers 0 slots
>                             cannot run in queue instance
> "cluster09.q at cn09006" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09005" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09003" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09002" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09007" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09004" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09001" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster09.q at cn09008" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10005" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10006" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10003" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10007" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10002" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10001" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10008" because PE "mpich_12" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10004" because PE "mpich_12" is not in pe list
>                             cannot run in PE "mpich_12" because it  
> only
> offers 0 slots
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list