[GE users] Node allocation considering network topolgy

Reuti reuti at staff.uni-marburg.de
Sat Mar 4 20:59:35 GMT 2006


> hostlist              @cluster09

contains all 8 machines? What is "qstat -g c" showing. - Reuti


Am 04.03.2006 um 21:50 schrieb Richard Ems:

> Reuti wrote:
>
>>> But the reported error "cannot run in PE "mpich_09" because it only
>>> offers 0 slots" will still be there. I got this error with slots=1.
>>> What's happening here?
>>>
>>
>> Are there serial jobs already running on these machines in this
>> queue(you could even remove "BATCH INTERACTIVE" to get a pure  
>> parallel
>> queue - depends on your setup of course? - Reuti
>
> There were jobs running on other queues, but now there is nothing
> running on all 8 nodes from cluster09, loads are near 0 on all nodes.
>
> qstat -j nnn still reports the same error: "cannot run in PE  
> "mpich_09"
> because it only offers 0 slots".
>
>
> # qstat -j 841
> ==============================================================
> job_number:                 841
> exec_file:                  job_scripts/841
> submission_time:            Sat Mar  4 19:45:23 2006
> owner:                      ems
> uid:                        501
> group:                      users
> gid:                        100
> sge_o_home:                 /net/fs02/home/ems
> sge_o_log_name:             ems
> sge_o_path:
> /opt/sge/bin/lx24-x86:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/ 
> usr/games:/opt/gnome/bin:/opt/kde3/bin:/usr/lib/java/bin
> sge_o_shell:                /bin/bash
> sge_o_workdir:              /net/fs02/home/ems/SGE/test
> sge_o_host:                 fs02
> account:                    sge
> cwd:                        /net/fs02/home/ems/SGE/test
> path_aliases:               /tmp_mnt/ * * /
> mail_options:               abes
> mail_list:                  ems at fs02
> notify:                     FALSE
> job_name:                   RUN-SGE-test.sh
> priority:                   600
> jobshare:                   0
> env_list:
> script_file:                RUN-SGE-test.sh
> parallel environment:  mpich_09 range: 8
> scheduling info:            queue instance "c_para at cn13001" dropped
> because it is overloaded: np_load_avg=0.490000 (= 0.490000 + 0.50 *
> 0.000000 with nproc=1) >= 0.25
>                             queue instance "c_para at cn24001" dropped
> because it is overloaded: np_load_avg=0.465000 (= 0.465000 + 0.50 *
> 0.000000 with nproc=1) >= 0.25
>                             queue instance "c_para at cn14001" dropped
> because it is overloaded: np_load_avg=0.455000 (= 0.455000 + 0.50 *
> 0.000000 with nproc=1) >= 0.25
>                             queue instance "c_para at cn22001" dropped
> because it is overloaded: np_load_avg=0.485000 (= 0.485000 + 0.50 *
> 0.000000 with nproc=1) >= 0.25
>                             queue instance "c_para at cn11001" dropped
> because it is overloaded: np_load_avg=0.455000 (= 0.455000 + 0.50 *
> 0.000000 with nproc=1) >= 0.25
>                             queue instance "c_para at cn19001" dropped
> because it is overloaded: np_load_avg=0.460000 (= 0.460000 + 0.50 *
> 0.000000 with nproc=1) >= 0.25
>                             queue instance "c_para at cn23001" dropped
> because it is overloaded: np_load_avg=0.455000 (= 0.455000 + 0.50 *
> 0.000000 with nproc=1) >= 0.25
>                             queue instance "c_para at cn18001" dropped
> because it is overloaded: np_load_avg=0.450000 (= 0.450000 + 0.50 *
> 0.000000 with nproc=1) >= 0.25
>                             queue instance "c_para at cn21001" dropped
> because it is overloaded: np_load_avg=0.460000 (= 0.460000 + 0.50 *
> 0.000000 with nproc=1) >= 0.25
>                             queue instance "c_para at cn20001" dropped
> because it is overloaded: np_load_avg=0.480000 (= 0.480000 + 0.50 *
> 0.000000 with nproc=1) >= 0.25
>                             queue instance "c_para at cn15001" dropped
> because it is overloaded: np_load_avg=0.470000 (= 0.470000 + 0.50 *
> 0.000000 with nproc=1) >= 0.25
>                             queue instance "c_para at cn12001" dropped
> because it is overloaded: np_load_avg=0.465000 (= 0.465000 + 0.50 *
> 0.000000 with nproc=1) >= 0.25
>                             queue instance "c_para at cn16001" dropped
> because it is overloaded: np_load_avg=0.357500 (= 0.325000 + 0.50 *
> 0.130000 with nproc=1) >= 0.25
>                             queue instance "c_para at cn17001" dropped
> because it is overloaded: np_load_avg=0.425000 (= 0.360000 + 0.50 *
> 0.260000 with nproc=1) >= 0.25
>                             queue instance "cluster12.q at cn12001"  
> dropped
> because it is disabled
>                             queue instance "cluster12.q at cn12002"  
> dropped
> because it is disabled
>                             queue instance "cluster12.q at cn12003"  
> dropped
> because it is disabled
>                             queue instance "cluster12.q at cn12004"  
> dropped
> because it is disabled
>                             queue instance "cluster12.q at cn12005"  
> dropped
> because it is disabled
>                             queue instance "cluster12.q at cn12006"  
> dropped
> because it is disabled
>                             queue instance "cluster12.q at cn12007"  
> dropped
> because it is disabled
>                             queue instance "cluster12.q at cn12008"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11001"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11002"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11003"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11004"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11005"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11006"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11007"  
> dropped
> because it is disabled
>                             queue instance "cluster11.q at cn11008"  
> dropped
> because it is disabled
>                             cannot run in queue instance
> "cluster10.q at cn10006" because PE "mpich_09" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10005" because PE "mpich_09" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10008" because PE "mpich_09" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10003" because PE "mpich_09" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10007" because PE "mpich_09" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10002" because PE "mpich_09" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10004" because PE "mpich_09" is not in pe list
>                             cannot run in queue instance
> "cluster10.q at cn10001" because PE "mpich_09" is not in pe list
>                             cannot run in PE "mpich_09" because it  
> only
> offers 0 slots
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list