[GE users] Strange problem with resource quotas in 6.2u5

reuti reuti at staff.uni-marburg.de
Sun Mar 14 16:08:33 GMT 2010


Sorry, wrong window sent:

although you changed already queue_sort_method, I think the whole
issue is related to the issue I mentioned http://gridengine.sunsource.net/issues/show_bug.cgi?id=2538
  which might result in different side effects.

Does it work better when you remove the RQS and add the same again?
And maybe you can add your findings to the above issue.

-- Reuti


Am 14.03.2010 um 11:34 schrieb icaci:

> Hi,
>
> I think I've narrowed down the issue to the following hypothesis:
>
> Imagine the following scenario. There are 3 hosts A, B and C, and
> two cluster queues long.q and short.q configured as such:
>
> short.q:
> seq_no  0,[A=100],[B=101],[C=102]
> slots   1
> pe_list mpi
> h_rt    1:0:0
>
> long.q:
> seq_no  0,[A=200],[B=201],[C=202]
> slots   1
> pe_list mpi
> h_rt    4:0:0
>
> Let the following RQS be in place:
> {
>  name         max_2_slots
>  description  NONE
>  enabled      TRUE
>  limit        users {*} to slots=2
> }
>
> This setup works fine as long as users submit serial jobs only.
> Parallel jobs with h_rt <= 1:0:0 also execute fine. But the
> following request:
>
> qsub -pe mpi 2 -l h_rt=2:0:0
>
> would give
>
> cannot run because it exceeds limit "username/////" in rule
> "max_2_slots/1"
>
> even if there are no other running jobs that belong to the same user.
>
> But if long.q's sequence numbers are modified to be lower than
> short.q's, the job will run fine. It will run even if just enough
> instances of long.q with free slots have lower sequence numbers than
> the first short.q instance.
>
> The issue only affects parallel jobs and it doesn't matter if
> queue_sort_method is set to "load" or to "seqno" in the scheduler's
> configuration.
>
> I tried to debug the sge_qmaster running it with dl 2 and dl 10 but
> the debug print code in parallel_limit_slots_by_time() segfaults the
> program:
>
> 28919  17654    worker001     checking limit slots
> 28920  17654    worker001 --> rqs_set_dynamical_limit() {
> 28921  17654    worker001 <-- rqs_set_dynamical_limit() ../libs/
> sched/sge_resource_quota_schedd.c 123 }
> 28922  17654    worker001 --> parallel_limit_slots_by_time() {
> 28923  17654    worker001     RD: 1
> t at 10 (l at 10) signal SEGV (no mapping at the fault address) in strlen
> at 0xfffffd7fff094b70
> 0xfffffd7fff094b70: strlen+0x0040:      movq     (%rsi),%rax
> dbx: read of 8 bytes at address 7fffffff failed
> dbx: warning: No frame with source found
> (dbx) where
> current thread: t at 10
> dbx: read of 8 bytes at address 7fffffff failed
> =>[1] strlen(0x0, 0x0, 0xfffffd7ffd3eb4e0, 0x73, 0x0, 0x2220), at
> 0xfffffd7fff094b70
>  [2] _ndoprnt(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff0f04da
>  [3] vsnprintf(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff0f1af1
>  [4] rmon_mprintf_va(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x674c6a
>  [5] rmon_mprintf_info(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x674901
>  [6] parallel_limit_slots_by_time(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at
> 0x53bbe4
>  [7] parallel_rqs_slots_by_time(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at
> 0x53c6b1
>  [8] 0xfffffd7ffd3ec91c(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at
> 0xfffffd7ffd3ec91c
>  [9] 0x1276e08(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x1276e08
>
> OS is SunOS fs001 5.10 Generic_142901-03 i86pc i386 i86p
>
> I'm not quite sure how to file an issue in the tracker and would be
> grateful if some of the people here more closely associated with the
> project could do it.
>
> I would try to dig deeper in the issue as soon as I manage to
> compile the code under Solaris.
>
> Regards,
>
> Hristo
>
> On 09.03.2010, at 16:16, reuti wrote:
>
>> Hi,
>>
>> Am 08.03.2010 um 11:03 schrieb icaci:
>>
>>> <snip>
>>> Resources just become available and I was able to conduct some
>>> tests. The results are beyond my comprehension. What I did was to
>>> submit the same sleeper parallel job which requires 56 slots and
>>> h_rt=47:59:59. The RQS in place are:
>>>
>>> {
>>> name         usr_long
>>> description  Limits imposed on ordinary users
>>> enabled      TRUE
>>> limit        users {*} queues *_long.q to slots=48
>>> }
>>> {
>>> name         usr_med+long
>>> description  Limits imposed on ordinary users
>>> enabled      TRUE
>>> limit        users {*} queues *_med.q,*_long.q to slots=64
>>> }
>>>
>>> This setup works and the job ends in p_med.q as expected. But if I
>>> change usr_med+long to
>>> limit users {*} queues * to slots=64
>>> or to
>>> limit users {*} queues *.q to slots=64
>>> or just to
>>> limit users {*} to slots=64
>>> I get
>>> cannot run because it exceeds limit "hristo/////" in rule "usr_med
>>> +long/1"
>>>
>>> It might be connected somehow to issue 2538 but changing
>>> queue_sort_order to load does not make the job run.
>>>
>>> Either I don't understand how RQS filter matching works or I should
>>> do some debugging. I should stick to specifying the full list of
>>> queues which does the trick for now.
>>
>> maybe you can add to the issue. It's still not clear to me whether it
>> related to the queue_sort_method and/or the usage of a wildcard.
>>
>> -- Reuti
>>
>>
>>> Thanks for your time,
>>>
>>> Hristo
>>>
>>>>
>>>>> Best regards,
>>>>>
>>>>> Hristo
>>>>>
>>>>>>
>>>>>>> But when I try to submit a simple 56-slot parallel job with
>>>>>>> something like:
>>>>>>>
>>>>>>> echo "sleep 30" | qsub -pe ompix8 56 -l h_rt=47:59:59
>>>>>>>
>>>>>>> the job stays in "qw" state and qstat shows the following:
>>>>>>> ...
>>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>>> "users/
>>>>>>> total"
>>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>>> "users/
>>>>>>> long"
>>>>>>> ...
>>>>>>> The 56 slots requirement clearly exceeds the 48 slots limit
>>>>>>> from the
>>>>>>> "users/long" rule, but for some obscure reason SGE thinks that
>>>>>>> it
>>>>>>> also exceeds the 64-slots limit from the "users/total" rule.
>>>>>>>
>>>>>>> I tried to split the ruleset into two separate rules:
>>>>>>>
>>>>>>> {
>>>>>>> name         users_long
>>>>>>> description  Limits imposed on ordinary users
>>>>>>> enabled      TRUE
>>>>>>> limit        users {*} queues *_long.q to slots=48
>>>>>>> }
>>>>>>> {
>>>>>>> name         users_total
>>>>>>> description  Limits imposed on ordinary users
>>>>>>> enabled      TRUE
>>>>>>> limit        users {*} to slots=64
>>>>>>> }
>>>>>>>
>>>>>>> Still no luck:
>>>>>>> ...
>>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>>> "users_total/1"
>>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>>> "users_total/1"
>>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>>> "users_total/1"
>>>>>>> ...
>>>>>>>
>>>>>>> The job runs fine if I disable the users_total rule.
>>>>>>>
>>>>>>> We used to run 6.2u2_1 before we upgraded to 6.2u5 and a
>>>>>>> colleague
>>>>>>> of mine insists that he was able to run 56-slots jobs before the
>>>>>>> upgrade. Have I stumbled upon a bug in 6.2u5 or did I miss the
>>>>>>> point
>>>>>>> in setting up my resource quotas?
>>>>>>>
>>>>>>> And help would be greatly appreciated.
>>>>>>>
>>>>>>> Hristo
>>>>>>> --
>>>>>>> Dr Hristo Iliev
>>>>>>> Monte Carlo research group
>>>>>>> Faculty of Physics, University of Sofia
>>>>>>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>>>>>>> http://cluster.phys.uni-sofia.bg/hristo/
>>>>>>>
>>>>>>> ------------------------------------------------------
>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>> dsForumId=38&dsMessageId=247462
>>>>>>>
>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>> ].
>>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>> dsForumId=38&dsMessageId=247465
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> ].
>>>>>
>>>>> --
>>>>> Dr Hristo Iliev
>>>>> Monte Carlo research group
>>>>> Faculty of Physics, University of Sofia
>>>>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>>>>> http://cluster.phys.uni-sofia.bg/hristo/
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>> dsForumId=38&dsMessageId=247501
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net
>>>>> ].
>>>>>
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=247503
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>
>>> --
>>> Dr Hristo Iliev
>>> Monte Carlo research group
>>> Faculty of Physics, University of Sofia
>>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>>> http://cluster.phys.uni-sofia.bg/hristo/
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=247506
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247694
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net
>> ].
>
> --
> Dr Hristo Iliev
> Monte Carlo research group
> Faculty of Physics, University of Sofia
> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
> http://cluster.phys.uni-sofia.bg/hristo/
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248538
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net
> ].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248569

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list