[GE users] Strange problem with resource quotas in 6.2u5

reuti reuti at staff.uni-marburg.de
Sun Mar 14 13:24:01 GMT 2010


Hi,

Am 14.03.2010 um 11:34 schrieb icaci:

> I think I've narrowed down the issue to the following hypothesis:
>
> Imagine the following scenario. There are 3 hosts A, B and C, and  
> two cluster queues long.q and short.q configured as such:
>
> short.q:
> seq_no  0,[A=100],[B=101],[C=102]
> slots   1
> pe_list mpi
> h_rt    1:0:0
>
> long.q:
> seq_no  0,[A=200],[B=201],[C=202]
> slots   1
> pe_list mpi
> h_rt    4:0:0
>
> Let the following RQS be in place:
> {
>   name         max_2_slots
>   description  NONE
>   enabled      TRUE
>   limit        users {*} to slots=2
> }
>
> This setup works fine as long as users submit serial jobs only.  
> Parallel jobs with h_rt <= 1:0:0 also execute fine. But the  
> following request:
>
> qsub -pe mpi 2 -l h_rt=2:0:0
>
> would give
>
> cannot run because it exceeds limit "username/////" in rule  
> "max_2_slots/1"
>
> even if there are no other running jobs that belong to the same user.
>
> But if long.q's sequence numbers are modified to be lower than  
> short.q's, the job will run fine. It will run even if just enough  
> instances of long.q with free slots have lower sequence numbers  
> than the first short.q instance.
>
> The issue only affects parallel jobs and it doesn't matter if  
> queue_sort_method is set to "load" or to "seqno" in the scheduler's  
> configuration.
>
> I tried to debug the sge_qmaster running it with dl 2 and dl 10 but  
> the debug print code in parallel_limit_slots_by_time() segfaults  
> the program:
>
>  28919  17654    worker001     checking limit slots
>  28920  17654    worker001 --> rqs_set_dynamical_limit() {
>  28921  17654    worker001 <-- rqs_set_dynamical_limit() ../libs/ 
> sched/sge_resource_quota_schedd.c 123 }
>  28922  17654    worker001 --> parallel_limit_slots_by_time() {
>  28923  17654    worker001     RD: 1
> t at 10 (l at 10) signal SEGV (no mapping at the fault address) in strlen  
> at 0xfffffd7fff094b70
> 0xfffffd7fff094b70: strlen+0x0040:      movq     (%rsi),%rax
> dbx: read of 8 bytes at address 7fffffff failed
> dbx: warning: No frame with source found
> (dbx) where
> current thread: t at 10
> dbx: read of 8 bytes at address 7fffffff failed
> =>[1] strlen(0x0, 0x0, 0xfffffd7ffd3eb4e0, 0x73, 0x0, 0x2220), at  
> 0xfffffd7fff094b70
>   [2] _ndoprnt(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff0f04da
>   [3] vsnprintf(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff0f1af1
>   [4] rmon_mprintf_va(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x674c6a
>   [5] rmon_mprintf_info(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x674901
>   [6] parallel_limit_slots_by_time(0x0, 0x0, 0x0, 0x0, 0x0, 0x0),  
> at 0x53bbe4
>   [7] parallel_rqs_slots_by_time(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at  
> 0x53c6b1
>   [8] 0xfffffd7ffd3ec91c(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at  
> 0xfffffd7ffd3ec91c
>   [9] 0x1276e08(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x1276e08
>
> OS is SunOS fs001 5.10 Generic_142901-03 i86pc i386 i86p
>
> I'm not quite sure how to file an issue in the tracker and would be  
> grateful if some of the people here more closely associated with  
> the project could do it.
>
> I would try to dig deeper in the issue as soon as I manage to  
> compile the code under Solaris.
>
> Regards,
>
> Hristo
>
> On 09.03.2010, at 16:16, reuti wrote:
>
>> Hi,
>>
>> Am 08.03.2010 um 11:03 schrieb icaci:
>>
>>> <snip>
>>> Resources just become available and I was able to conduct some
>>> tests. The results are beyond my comprehension. What I did was to
>>> submit the same sleeper parallel job which requires 56 slots and
>>> h_rt=47:59:59. The RQS in place are:
>>>
>>> {
>>>  name         usr_long
>>>  description  Limits imposed on ordinary users
>>>  enabled      TRUE
>>>  limit        users {*} queues *_long.q to slots=48
>>> }
>>> {
>>>  name         usr_med+long
>>>  description  Limits imposed on ordinary users
>>>  enabled      TRUE
>>>  limit        users {*} queues *_med.q,*_long.q to slots=64
>>> }
>>>
>>> This setup works and the job ends in p_med.q as expected. But if I
>>> change usr_med+long to
>>> limit users {*} queues * to slots=64
>>> or to
>>> limit users {*} queues *.q to slots=64
>>> or just to
>>> limit users {*} to slots=64
>>> I get
>>> cannot run because it exceeds limit "hristo/////" in rule "usr_med
>>> +long/1"
>>>
>>> It might be connected somehow to issue 2538 but changing
>>> queue_sort_order to load does not make the job run.
>>>
>>> Either I don't understand how RQS filter matching works or I should
>>> do some debugging. I should stick to specifying the full list of
>>> queues which does the trick for now.
>>
>> maybe you can add to the issue. It's still not clear to me whether it
>> related to the queue_sort_method and/or the usage of a wildcard.
>>
>> -- Reuti
>>
>>
>>> Thanks for your time,
>>>
>>> Hristo
>>>
>>>>
>>>>> Best regards,
>>>>>
>>>>> Hristo
>>>>>
>>>>>>
>>>>>>> But when I try to submit a simple 56-slot parallel job with
>>>>>>> something like:
>>>>>>>
>>>>>>> echo "sleep 30" | qsub -pe ompix8 56 -l h_rt=47:59:59
>>>>>>>
>>>>>>> the job stays in "qw" state and qstat shows the following:
>>>>>>> ...
>>>>>>> cannot run because it exceeds limit "hristo/////" in rule  
>>>>>>> "users/
>>>>>>> total"
>>>>>>> cannot run because it exceeds limit "hristo/////" in rule  
>>>>>>> "users/
>>>>>>> long"
>>>>>>> ...
>>>>>>> The 56 slots requirement clearly exceeds the 48 slots limit
>>>>>>> from the
>>>>>>> "users/long" rule, but for some obscure reason SGE thinks  
>>>>>>> that it
>>>>>>> also exceeds the 64-slots limit from the "users/total" rule.
>>>>>>>
>>>>>>> I tried to split the ruleset into two separate rules:
>>>>>>>
>>>>>>> {
>>>>>>> name         users_long
>>>>>>> description  Limits imposed on ordinary users
>>>>>>> enabled      TRUE
>>>>>>> limit        users {*} queues *_long.q to slots=48
>>>>>>> }
>>>>>>> {
>>>>>>> name         users_total
>>>>>>> description  Limits imposed on ordinary users
>>>>>>> enabled      TRUE
>>>>>>> limit        users {*} to slots=64
>>>>>>> }
>>>>>>>
>>>>>>> Still no luck:
>>>>>>> ...
>>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>>> "users_total/1"
>>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>>> "users_total/1"
>>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>>> "users_total/1"
>>>>>>> ...
>>>>>>>
>>>>>>> The job runs fine if I disable the users_total rule.
>>>>>>>
>>>>>>> We used to run 6.2u2_1 before we upgraded to 6.2u5 and a  
>>>>>>> colleague
>>>>>>> of mine insists that he was able to run 56-slots jobs before the
>>>>>>> upgrade. Have I stumbled upon a bug in 6.2u5 or did I miss the
>>>>>>> point
>>>>>>> in setting up my resource quotas?
>>>>>>>
>>>>>>> And help would be greatly appreciated.
>>>>>>>
>>>>>>> Hristo
>>>>>>> --
>>>>>>> Dr Hristo Iliev
>>>>>>> Monte Carlo research group
>>>>>>> Faculty of Physics, University of Sofia
>>>>>>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>>>>>>> http://cluster.phys.uni-sofia.bg/hristo/
>>>>>>>
>>>>>>> ------------------------------------------------------
>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>> dsForumId=38&dsMessageId=247462
>>>>>>>
>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>> ].
>>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>> dsForumId=38&dsMessageId=247465
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> ].
>>>>>
>>>>> --
>>>>> Dr Hristo Iliev
>>>>> Monte Carlo research group
>>>>> Faculty of Physics, University of Sofia
>>>>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>>>>> http://cluster.phys.uni-sofia.bg/hristo/
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>> dsForumId=38&dsMessageId=247501
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net
>>>>> ].
>>>>>
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=247503
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>
>>> --
>>> Dr Hristo Iliev
>>> Monte Carlo research group
>>> Faculty of Physics, University of Sofia
>>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>>> http://cluster.phys.uni-sofia.bg/hristo/
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=247506
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=247694
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>
> --
> Dr Hristo Iliev
> Monte Carlo research group
> Faculty of Physics, University of Sofia
> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
> http://cluster.phys.uni-sofia.bg/hristo/
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=248538
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248553

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list