[GE users] Node allocation considering network topolgy

Reuti reuti at staff.uni-marburg.de
Sat Mar 4 21:24:36 GMT 2006


Am 04.03.2006 um 21:59 schrieb Reuti:

>> hostlist              @cluster09
>
> contains all 8 machines? What is "qstat -g c" showing. - Reuti
>

I forgot: or is the same PE attached by accident to another queue? -  
Reuti


>
> Am 04.03.2006 um 21:50 schrieb Richard Ems:
>
>> Reuti wrote:
>>
>>>> But the reported error "cannot run in PE "mpich_09" because it only
>>>> offers 0 slots" will still be there. I got this error with slots=1.
>>>> What's happening here?
>>>>
>>>
>>> Are there serial jobs already running on these machines in this
>>> queue(you could even remove "BATCH INTERACTIVE" to get a pure  
>>> parallel
>>> queue - depends on your setup of course? - Reuti
>>
>> There were jobs running on other queues, but now there is nothing
>> running on all 8 nodes from cluster09, loads are near 0 on all nodes.
>>
>> qstat -j nnn still reports the same error: "cannot run in PE  
>> "mpich_09"
>> because it only offers 0 slots".
>>
>>
>> # qstat -j 841
>> ==============================================================
>> job_number:                 841
>> exec_file:                  job_scripts/841
>> submission_time:            Sat Mar  4 19:45:23 2006
>> owner:                      ems
>> uid:                        501
>> group:                      users
>> gid:                        100
>> sge_o_home:                 /net/fs02/home/ems
>> sge_o_log_name:             ems
>> sge_o_path:
>> /opt/sge/bin/lx24-x86:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/ 
>> usr/games:/opt/gnome/bin:/opt/kde3/bin:/usr/lib/java/bin
>> sge_o_shell:                /bin/bash
>> sge_o_workdir:              /net/fs02/home/ems/SGE/test
>> sge_o_host:                 fs02
>> account:                    sge
>> cwd:                        /net/fs02/home/ems/SGE/test
>> path_aliases:               /tmp_mnt/ * * /
>> mail_options:               abes
>> mail_list:                  ems at fs02
>> notify:                     FALSE
>> job_name:                   RUN-SGE-test.sh
>> priority:                   600
>> jobshare:                   0
>> env_list:
>> script_file:                RUN-SGE-test.sh
>> parallel environment:  mpich_09 range: 8
>> scheduling info:            queue instance "c_para at cn13001" dropped
>> because it is overloaded: np_load_avg=0.490000 (= 0.490000 + 0.50 *
>> 0.000000 with nproc=1) >= 0.25
>>                             queue instance "c_para at cn24001" dropped
>> because it is overloaded: np_load_avg=0.465000 (= 0.465000 + 0.50 *
>> 0.000000 with nproc=1) >= 0.25
>>                             queue instance "c_para at cn14001" dropped
>> because it is overloaded: np_load_avg=0.455000 (= 0.455000 + 0.50 *
>> 0.000000 with nproc=1) >= 0.25
>>                             queue instance "c_para at cn22001" dropped
>> because it is overloaded: np_load_avg=0.485000 (= 0.485000 + 0.50 *
>> 0.000000 with nproc=1) >= 0.25
>>                             queue instance "c_para at cn11001" dropped
>> because it is overloaded: np_load_avg=0.455000 (= 0.455000 + 0.50 *
>> 0.000000 with nproc=1) >= 0.25
>>                             queue instance "c_para at cn19001" dropped
>> because it is overloaded: np_load_avg=0.460000 (= 0.460000 + 0.50 *
>> 0.000000 with nproc=1) >= 0.25
>>                             queue instance "c_para at cn23001" dropped
>> because it is overloaded: np_load_avg=0.455000 (= 0.455000 + 0.50 *
>> 0.000000 with nproc=1) >= 0.25
>>                             queue instance "c_para at cn18001" dropped
>> because it is overloaded: np_load_avg=0.450000 (= 0.450000 + 0.50 *
>> 0.000000 with nproc=1) >= 0.25
>>                             queue instance "c_para at cn21001" dropped
>> because it is overloaded: np_load_avg=0.460000 (= 0.460000 + 0.50 *
>> 0.000000 with nproc=1) >= 0.25
>>                             queue instance "c_para at cn20001" dropped
>> because it is overloaded: np_load_avg=0.480000 (= 0.480000 + 0.50 *
>> 0.000000 with nproc=1) >= 0.25
>>                             queue instance "c_para at cn15001" dropped
>> because it is overloaded: np_load_avg=0.470000 (= 0.470000 + 0.50 *
>> 0.000000 with nproc=1) >= 0.25
>>                             queue instance "c_para at cn12001" dropped
>> because it is overloaded: np_load_avg=0.465000 (= 0.465000 + 0.50 *
>> 0.000000 with nproc=1) >= 0.25
>>                             queue instance "c_para at cn16001" dropped
>> because it is overloaded: np_load_avg=0.357500 (= 0.325000 + 0.50 *
>> 0.130000 with nproc=1) >= 0.25
>>                             queue instance "c_para at cn17001" dropped
>> because it is overloaded: np_load_avg=0.425000 (= 0.360000 + 0.50 *
>> 0.260000 with nproc=1) >= 0.25
>>                             queue instance "cluster12.q at cn12001"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster12.q at cn12002"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster12.q at cn12003"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster12.q at cn12004"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster12.q at cn12005"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster12.q at cn12006"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster12.q at cn12007"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster12.q at cn12008"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster11.q at cn11001"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster11.q at cn11002"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster11.q at cn11003"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster11.q at cn11004"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster11.q at cn11005"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster11.q at cn11006"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster11.q at cn11007"  
>> dropped
>> because it is disabled
>>                             queue instance "cluster11.q at cn11008"  
>> dropped
>> because it is disabled
>>                             cannot run in queue instance
>> "cluster10.q at cn10006" because PE "mpich_09" is not in pe list
>>                             cannot run in queue instance
>> "cluster10.q at cn10005" because PE "mpich_09" is not in pe list
>>                             cannot run in queue instance
>> "cluster10.q at cn10008" because PE "mpich_09" is not in pe list
>>                             cannot run in queue instance
>> "cluster10.q at cn10003" because PE "mpich_09" is not in pe list
>>                             cannot run in queue instance
>> "cluster10.q at cn10007" because PE "mpich_09" is not in pe list
>>                             cannot run in queue instance
>> "cluster10.q at cn10002" because PE "mpich_09" is not in pe list
>>                             cannot run in queue instance
>> "cluster10.q at cn10004" because PE "mpich_09" is not in pe list
>>                             cannot run in queue instance
>> "cluster10.q at cn10001" because PE "mpich_09" is not in pe list
>>                             cannot run in PE "mpich_09" because it  
>> only
>> offers 0 slots
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list