[GE users] Allocate cores in the same enclosure

craffi dag at sonsorol.org
Wed Dec 17 13:59:32 GMT 2008


What is the output of "qstat -f"? Could it be that your queues are all  
in error state and thus have no free slots to offer?

-Chris


On Dec 17, 2008, at 8:48 AM, Sofia Bassil wrote:

> Hi,
>
> OK. I think I have followed the instructions exactly, but I get the  
> error so something must be missing. Here is what I set up:
>
> 2 queues, test1 and test2
> 2 PE:s, test_1 and test_2
> 2 hostgroups, @testSub1 and @testSub2
> (sorry about the poor namings)
>
> # qconf -sq test1
> qname test1 #test2
> hostlist @testSub1 #@testSub2
> seq_no 0
> load_thresholds np_load_avg=1.50
> suspend_thresholds NONE
> nsuspend 1
> suspend_interval 00:05:00
> priority 0
> min_cpu_interval 00:05:00
> processors UNDEFINED
> qtype BATCH INTERATIVE
> ckpt_list NONE
> pe_list test_1 #test_2
> rerun FALSE
> slots 4
> tmpdir /tmp
> shell /bin/bash
> prolog NONE
> epilog NONE
> shell_start_mode unix_behavior
> starter_method.....terminate_metod=NONE
> notify 00:00:60
> owner_list....calendar=NONE
> initial_state default
> s_rt....h_vmem=INFINITY
>
> The only thing that differs in the queue test2 is qname, hostlist,  
> and pe_list which are set to test2, @testSub2, and test_2  
> respectively. There are 4 cores per machine and a total of 8  
> machines in each hostgroup.
>
> # qconf -se test_1
> pe_name test_1
> slots 24
> user_lists NONE
> xuser_lists NONE
> start_proc_args /bin/true
> stop_proc_args /bin/true
> allocation_rule $round_robin
> control_slaves FALSE
> job_is_first_rank FALSE
> urgency_slots min
>
> The only thing that differs in the PE test_2 is pe_name which is set  
> to test_2.
>
> # qconf -shgrp @testSub1
> group_name @testSub1
> hostlist host1.my.domain host2.my.domain host3.my.domain  
> host4.my.domain host5.my.domain \
>            host6.my.domain host7.my.domain host8.my.domain
>
> In testSub2 the group_name and the hostnames differ.
>
> When I run a job like this:
> $ qsub -pe "test*" 1 ./testSub.sh
> I get the error:
> cannot run in PE "test_1" because it only offers 0 slots
> cannot run in PE "test_2" because it only offers 0 slots
>
> $ qstat -j <jobid>|grep -v dropped
> ...
> sge_o_shell: /bin/bash
> sge_o_workdir: /home/me
> sge_o_host: myhost
> account: sge
> mail_list: me at myhost.my.domain
> notify: FALSE
> job_name: testSub.sh
> jobshare: 0
> env_list:
> script_file: ./testSub.sh
> parallel environment: test* range: 1
>                                         cannot run in queue  
> "anotherq" because PE "test_1" is not in pe list
>                                         cannot run in PE "test_1"  
> because it only offers 0 slots
>                                         cannot run in queue  
> "anotherq" because PE "test_2" is not in pe list
>                                         cannot run in PE "test_2"  
> because it only offers 0 slots
>
> $ cat testSub.sh
> echo hostname=`hostname`
> sleep 30
>
> //Sofia
>
>
> Gerald Ragghianti skrev:
>>
>> Hi Sofia,
>> Yes, I think that the solution described at
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=21159
>> is correct for your case.  We have a similar situation where we  
>> need to
>> group nodes according to which backend interconnect they use.  This
>> solution does work, but you will need to make sure that the queu
>> instances for each node are included in the two cluster queues that  
>> you
>> created (one queue for each PE that you have).
>>
>> - Gerald
>>
>> Sofia Bassil wrote:
>>
>>> Hi Chris and thanks for the reply. I will gladly supply  
>>> configuration
>>> information, but my first question is if this solution is even  
>>> relevant
>>> for my problem?
>>>
>>> //Sofia
>>>
>>>
>>> craffi skrev:
>>>
>>>
>>>> 2 questions --
>>>>
>>>>   - What is the output of "qstat -f" ?
>>>>   - Have you attached the PE's to any queues?
>>>>
>>>> -Chris
>>>>
>>>>
>>>> On Dec 15, 2008, at 10:53 AM, Sofia Bassil wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> Hello,
>>>>>
>>>>> I am trying to set up a node allocation scheme depending on  
>>>>> network
>>>>> layout, following this thread:
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=21153
>>>>>
>>>>> I can't get it to work even though I have the same configuration  
>>>>> as
>>>>> far
>>>>> as I can see. I am not using the same version of Grid Engine  
>>>>> though, I
>>>>> am using GE 6.1u4.
>>>>>
>>>>> My cluster basically consists of a few blade enclosures and I want
>>>>> it to
>>>>> be possible to ask for cores in one enclosure, so that you can  
>>>>> utilize
>>>>> the better banwidth within the enclosure. My plan is to set up all
>>>>> enclosures with one hostgroup, one queue, and one pe each, like  
>>>>> in the
>>>>> example.
>>>>>
>>>>> When I run a job like this:
>>>>> $ qsub -pe "test*" 1 ./testSub.sh
>>>>> I get the error:
>>>>> cannot run in PE "test_1" because it only offers 0 slots
>>>>> cannot run in PE "test_2" because it only offers 0 slots
>>>>>
>>>>> My first question is, is it still possible to configure this (in  
>>>>> my
>>>>> version and in 6.2)?
>>>>> My second question is, for my problem, is this the right solution?
>>>>>
>>>>> Sincerely,
>>>>> Sofia Bassil
>>>>>
>>>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92775
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
>>> ].
>>>
>>>
>>
>>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92973

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list