[GE users] Strange problem with resource quotas in 6.2u5

icaci hristo at mc.phys.uni-sofia.bg
Sun Mar 14 23:38:43 GMT 2010


On 14.03.2010, at 18:08, reuti wrote:

> Sorry, wrong window sent:
>
> although you changed already queue_sort_method, I think the whole
> issue is related to the issue I mentioned http://gridengine.sunsource.net/issues/show_bug.cgi?id=2538
>  which might result in different side effects.
>

I'm still not quite convinced that our issue stems from bug #2538.

> Does it work better when you remove the RQS and add the same again?
> And maybe you can add your findings to the above issue.

No, it doesn't work any better if I remove the RQS and then re-add it. It doesn't work even if I remove all the other RQS'es. The bad behaviour is very persistent and only affects parallel jobs when they are matched by a rule that selects more than one cluster queue. However I've just found a workaround: one has to specify the destination queue with "-q long.q".

Regards,

Hristo

>
> -- Reuti
>
>
> Am 14.03.2010 um 11:34 schrieb icaci:
>
>> Hi,
>>
>> I think I've narrowed down the issue to the following hypothesis:
>>
>> Imagine the following scenario. There are 3 hosts A, B and C, and
>> two cluster queues long.q and short.q configured as such:
>>
>> short.q:
>> seq_no  0,[A=100],[B=101],[C=102]
>> slots   1
>> pe_list mpi
>> h_rt    1:0:0
>>
>> long.q:
>> seq_no  0,[A=200],[B=201],[C=202]
>> slots   1
>> pe_list mpi
>> h_rt    4:0:0
>>
>> Let the following RQS be in place:
>> {
>> name         max_2_slots
>> description  NONE
>> enabled      TRUE
>> limit        users {*} to slots=2
>> }
>>
>> This setup works fine as long as users submit serial jobs only.
>> Parallel jobs with h_rt <= 1:0:0 also execute fine. But the
>> following request:
>>
>> qsub -pe mpi 2 -l h_rt=2:0:0
>>
>> would give
>>
>> cannot run because it exceeds limit "username/////" in rule
>> "max_2_slots/1"
>>
>> even if there are no other running jobs that belong to the same user.
>>
>> But if long.q's sequence numbers are modified to be lower than
>> short.q's, the job will run fine. It will run even if just enough
>> instances of long.q with free slots have lower sequence numbers than
>> the first short.q instance.
>>
>> The issue only affects parallel jobs and it doesn't matter if
>> queue_sort_method is set to "load" or to "seqno" in the scheduler's
>> configuration.
>>
>> I tried to debug the sge_qmaster running it with dl 2 and dl 10 but
>> the debug print code in parallel_limit_slots_by_time() segfaults the
>> program:
>>
>> 28919  17654    worker001     checking limit slots
>> 28920  17654    worker001 --> rqs_set_dynamical_limit() {
>> 28921  17654    worker001 <-- rqs_set_dynamical_limit() ../libs/
>> sched/sge_resource_quota_schedd.c 123 }
>> 28922  17654    worker001 --> parallel_limit_slots_by_time() {
>> 28923  17654    worker001     RD: 1
>> t at 10 (l at 10) signal SEGV (no mapping at the fault address) in strlen
>> at 0xfffffd7fff094b70
>> 0xfffffd7fff094b70: strlen+0x0040:      movq     (%rsi),%rax
>> dbx: read of 8 bytes at address 7fffffff failed
>> dbx: warning: No frame with source found
>> (dbx) where
>> current thread: t at 10
>> dbx: read of 8 bytes at address 7fffffff failed
>> =>[1] strlen(0x0, 0x0, 0xfffffd7ffd3eb4e0, 0x73, 0x0, 0x2220), at
>> 0xfffffd7fff094b70
>> [2] _ndoprnt(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff0f04da
>> [3] vsnprintf(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff0f1af1
>> [4] rmon_mprintf_va(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x674c6a
>> [5] rmon_mprintf_info(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x674901
>> [6] parallel_limit_slots_by_time(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at
>> 0x53bbe4
>> [7] parallel_rqs_slots_by_time(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at
>> 0x53c6b1
>> [8] 0xfffffd7ffd3ec91c(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at
>> 0xfffffd7ffd3ec91c
>> [9] 0x1276e08(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x1276e08
>>
>> OS is SunOS fs001 5.10 Generic_142901-03 i86pc i386 i86p
>>
>> I'm not quite sure how to file an issue in the tracker and would be
>> grateful if some of the people here more closely associated with the
>> project could do it.
>>
>> I would try to dig deeper in the issue as soon as I manage to
>> compile the code under Solaris.
>>
>> Regards,
>>
>> Hristo
>>
>> On 09.03.2010, at 16:16, reuti wrote:
>>
>>> Hi,
>>>
>>> Am 08.03.2010 um 11:03 schrieb icaci:
>>>
>>>> <snip>
>>>> Resources just become available and I was able to conduct some
>>>> tests. The results are beyond my comprehension. What I did was to
>>>> submit the same sleeper parallel job which requires 56 slots and
>>>> h_rt=47:59:59. The RQS in place are:
>>>>
>>>> {
>>>> name         usr_long
>>>> description  Limits imposed on ordinary users
>>>> enabled      TRUE
>>>> limit        users {*} queues *_long.q to slots=48
>>>> }
>>>> {
>>>> name         usr_med+long
>>>> description  Limits imposed on ordinary users
>>>> enabled      TRUE
>>>> limit        users {*} queues *_med.q,*_long.q to slots=64
>>>> }
>>>>
>>>> This setup works and the job ends in p_med.q as expected. But if I
>>>> change usr_med+long to
>>>> limit users {*} queues * to slots=64
>>>> or to
>>>> limit users {*} queues *.q to slots=64
>>>> or just to
>>>> limit users {*} to slots=64
>>>> I get
>>>> cannot run because it exceeds limit "hristo/////" in rule "usr_med
>>>> +long/1"
>>>>
>>>> It might be connected somehow to issue 2538 but changing
>>>> queue_sort_order to load does not make the job run.
>>>>
>>>> Either I don't understand how RQS filter matching works or I should
>>>> do some debugging. I should stick to specifying the full list of
>>>> queues which does the trick for now.
>>>
>>> maybe you can add to the issue. It's still not clear to me whether it
>>> related to the queue_sort_method and/or the usage of a wildcard.
>>>
>>> -- Reuti
>>>
>>>
>>>> Thanks for your time,
>>>>
>>>> Hristo
>>>>
>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Hristo
>>>>>>
>>>>>>>
>>>>>>>> But when I try to submit a simple 56-slot parallel job with
>>>>>>>> something like:
>>>>>>>>
>>>>>>>> echo "sleep 30" | qsub -pe ompix8 56 -l h_rt=47:59:59
>>>>>>>>
>>>>>>>> the job stays in "qw" state and qstat shows the following:
>>>>>>>> ...
>>>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>>>> "users/
>>>>>>>> total"
>>>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>>>> "users/
>>>>>>>> long"
>>>>>>>> ...
>>>>>>>> The 56 slots requirement clearly exceeds the 48 slots limit
>>>>>>>> from the
>>>>>>>> "users/long" rule, but for some obscure reason SGE thinks that
>>>>>>>> it
>>>>>>>> also exceeds the 64-slots limit from the "users/total" rule.
>>>>>>>>
>>>>>>>> I tried to split the ruleset into two separate rules:
>>>>>>>>
>>>>>>>> {
>>>>>>>> name         users_long
>>>>>>>> description  Limits imposed on ordinary users
>>>>>>>> enabled      TRUE
>>>>>>>> limit        users {*} queues *_long.q to slots=48
>>>>>>>> }
>>>>>>>> {
>>>>>>>> name         users_total
>>>>>>>> description  Limits imposed on ordinary users
>>>>>>>> enabled      TRUE
>>>>>>>> limit        users {*} to slots=64
>>>>>>>> }
>>>>>>>>
>>>>>>>> Still no luck:
>>>>>>>> ...
>>>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>>>> "users_total/1"
>>>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>>>> "users_total/1"
>>>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>>>> "users_total/1"
>>>>>>>> ...
>>>>>>>>
>>>>>>>> The job runs fine if I disable the users_total rule.
>>>>>>>>
>>>>>>>> We used to run 6.2u2_1 before we upgraded to 6.2u5 and a
>>>>>>>> colleague
>>>>>>>> of mine insists that he was able to run 56-slots jobs before the
>>>>>>>> upgrade. Have I stumbled upon a bug in 6.2u5 or did I miss the
>>>>>>>> point
>>>>>>>> in setting up my resource quotas?
>>>>>>>>
>>>>>>>> And help would be greatly appreciated.
>>>>>>>>
>>>>>>>> Hristo
>>>>>>>> --
>>>>>>>> Dr Hristo Iliev
>>>>>>>> Monte Carlo research group
>>>>>>>> Faculty of Physics, University of Sofia
>>>>>>>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>>>>>>>> http://cluster.phys.uni-sofia.bg/hristo/
>>>>>>>>
>>>>>>>> ------------------------------------------------------
>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>>> dsForumId=38&dsMessageId=247462
>>>>>>>>
>>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>> ].
>>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------
>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>> dsForumId=38&dsMessageId=247465
>>>>>>>
>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>> ].
>>>>>>
>>>>>> --
>>>>>> Dr Hristo Iliev
>>>>>> Monte Carlo research group
>>>>>> Faculty of Physics, University of Sofia
>>>>>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>>>>>> http://cluster.phys.uni-sofia.bg/hristo/
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>> dsForumId=38&dsMessageId=247501
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> ].
>>>>>>
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>> dsForumId=38&dsMessageId=247503
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>> --
>>>> Dr Hristo Iliev
>>>> Monte Carlo research group
>>>> Faculty of Physics, University of Sofia
>>>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>>>> http://cluster.phys.uni-sofia.bg/hristo/
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=247506
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247694
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net
>>> ].
>>
>> --
>> Dr Hristo Iliev
>> Monte Carlo research group
>> Faculty of Physics, University of Sofia
>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>> http://cluster.phys.uni-sofia.bg/hristo/
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248538
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net
>> ].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248569
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

--
Dr Hristo Iliev
Monte Carlo research group
Faculty of Physics, University of Sofia
5 James Bourchier blvd, 1164 Sofia, Bulgaria
http://cluster.phys.uni-sofia.bg/hristo/

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248614

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list