[GE users] Strange problem with resource quotas in 6.2u5

icaci hristo at mc.phys.uni-sofia.bg
Sun Mar 14 10:34:17 GMT 2010


Hi,

I think I've narrowed down the issue to the following hypothesis:

Imagine the following scenario. There are 3 hosts A, B and C, and two cluster queues long.q and short.q configured as such:

short.q:
seq_no  0,[A=100],[B=101],[C=102]
slots   1
pe_list mpi
h_rt    1:0:0

long.q:
seq_no  0,[A=200],[B=201],[C=202]
slots   1
pe_list mpi
h_rt    4:0:0

Let the following RQS be in place:
{
  name         max_2_slots
  description  NONE
  enabled      TRUE
  limit        users {*} to slots=2
}

This setup works fine as long as users submit serial jobs only. Parallel jobs with h_rt <= 1:0:0 also execute fine. But the following request:

qsub -pe mpi 2 -l h_rt=2:0:0

would give

cannot run because it exceeds limit "username/////" in rule "max_2_slots/1"

even if there are no other running jobs that belong to the same user.

But if long.q's sequence numbers are modified to be lower than short.q's, the job will run fine. It will run even if just enough instances of long.q with free slots have lower sequence numbers than the first short.q instance.

The issue only affects parallel jobs and it doesn't matter if queue_sort_method is set to "load" or to "seqno" in the scheduler's configuration.

I tried to debug the sge_qmaster running it with dl 2 and dl 10 but the debug print code in parallel_limit_slots_by_time() segfaults the program:

 28919  17654    worker001     checking limit slots
 28920  17654    worker001 --> rqs_set_dynamical_limit() {
 28921  17654    worker001 <-- rqs_set_dynamical_limit() ../libs/sched/sge_resource_quota_schedd.c 123 }
 28922  17654    worker001 --> parallel_limit_slots_by_time() {
 28923  17654    worker001     RD: 1
t at 10 (l at 10) signal SEGV (no mapping at the fault address) in strlen at 0xfffffd7fff094b70
0xfffffd7fff094b70: strlen+0x0040:      movq     (%rsi),%rax
dbx: read of 8 bytes at address 7fffffff failed
dbx: warning: No frame with source found
(dbx) where
current thread: t at 10
dbx: read of 8 bytes at address 7fffffff failed
=>[1] strlen(0x0, 0x0, 0xfffffd7ffd3eb4e0, 0x73, 0x0, 0x2220), at 0xfffffd7fff094b70 
  [2] _ndoprnt(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff0f04da 
  [3] vsnprintf(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff0f1af1 
  [4] rmon_mprintf_va(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x674c6a 
  [5] rmon_mprintf_info(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x674901 
  [6] parallel_limit_slots_by_time(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x53bbe4 
  [7] parallel_rqs_slots_by_time(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x53c6b1 
  [8] 0xfffffd7ffd3ec91c(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffd3ec91c 
  [9] 0x1276e08(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x1276e08

OS is SunOS fs001 5.10 Generic_142901-03 i86pc i386 i86p

I'm not quite sure how to file an issue in the tracker and would be grateful if some of the people here more closely associated with the project could do it.

I would try to dig deeper in the issue as soon as I manage to compile the code under Solaris.

Regards,

Hristo
 
On 09.03.2010, at 16:16, reuti wrote:

> Hi,
> 
> Am 08.03.2010 um 11:03 schrieb icaci:
> 
>> <snip>
>> Resources just become available and I was able to conduct some  
>> tests. The results are beyond my comprehension. What I did was to  
>> submit the same sleeper parallel job which requires 56 slots and  
>> h_rt=47:59:59. The RQS in place are:
>> 
>> {
>>  name         usr_long
>>  description  Limits imposed on ordinary users
>>  enabled      TRUE
>>  limit        users {*} queues *_long.q to slots=48
>> }
>> {
>>  name         usr_med+long
>>  description  Limits imposed on ordinary users
>>  enabled      TRUE
>>  limit        users {*} queues *_med.q,*_long.q to slots=64
>> }
>> 
>> This setup works and the job ends in p_med.q as expected. But if I  
>> change usr_med+long to
>> limit users {*} queues * to slots=64
>> or to
>> limit users {*} queues *.q to slots=64
>> or just to
>> limit users {*} to slots=64
>> I get
>> cannot run because it exceeds limit "hristo/////" in rule "usr_med 
>> +long/1"
>> 
>> It might be connected somehow to issue 2538 but changing  
>> queue_sort_order to load does not make the job run.
>> 
>> Either I don't understand how RQS filter matching works or I should  
>> do some debugging. I should stick to specifying the full list of  
>> queues which does the trick for now.
> 
> maybe you can add to the issue. It's still not clear to me whether it  
> related to the queue_sort_method and/or the usage of a wildcard.
> 
> -- Reuti
> 
> 
>> Thanks for your time,
>> 
>> Hristo
>> 
>>> 
>>>> Best regards,
>>>> 
>>>> Hristo
>>>> 
>>>>> 
>>>>>> But when I try to submit a simple 56-slot parallel job with
>>>>>> something like:
>>>>>> 
>>>>>> echo "sleep 30" | qsub -pe ompix8 56 -l h_rt=47:59:59
>>>>>> 
>>>>>> the job stays in "qw" state and qstat shows the following:
>>>>>> ...
>>>>>> cannot run because it exceeds limit "hristo/////" in rule "users/
>>>>>> total"
>>>>>> cannot run because it exceeds limit "hristo/////" in rule "users/
>>>>>> long"
>>>>>> ...
>>>>>> The 56 slots requirement clearly exceeds the 48 slots limit  
>>>>>> from the
>>>>>> "users/long" rule, but for some obscure reason SGE thinks that it
>>>>>> also exceeds the 64-slots limit from the "users/total" rule.
>>>>>> 
>>>>>> I tried to split the ruleset into two separate rules:
>>>>>> 
>>>>>> {
>>>>>> name         users_long
>>>>>> description  Limits imposed on ordinary users
>>>>>> enabled      TRUE
>>>>>> limit        users {*} queues *_long.q to slots=48
>>>>>> }
>>>>>> {
>>>>>> name         users_total
>>>>>> description  Limits imposed on ordinary users
>>>>>> enabled      TRUE
>>>>>> limit        users {*} to slots=64
>>>>>> }
>>>>>> 
>>>>>> Still no luck:
>>>>>> ...
>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>> "users_total/1"
>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>> "users_total/1"
>>>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>>>> "users_total/1"
>>>>>> ...
>>>>>> 
>>>>>> The job runs fine if I disable the users_total rule.
>>>>>> 
>>>>>> We used to run 6.2u2_1 before we upgraded to 6.2u5 and a colleague
>>>>>> of mine insists that he was able to run 56-slots jobs before the
>>>>>> upgrade. Have I stumbled upon a bug in 6.2u5 or did I miss the  
>>>>>> point
>>>>>> in setting up my resource quotas?
>>>>>> 
>>>>>> And help would be greatly appreciated.
>>>>>> 
>>>>>> Hristo
>>>>>> --
>>>>>> Dr Hristo Iliev
>>>>>> Monte Carlo research group
>>>>>> Faculty of Physics, University of Sofia
>>>>>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>>>>>> http://cluster.phys.uni-sofia.bg/hristo/
>>>>>> 
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>>>>> dsForumId=38&dsMessageId=247462
>>>>>> 
>>>>>> To unsubscribe from this discussion, e-mail: [users- 
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> ].
>>>>>> 
>>>>> 
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>>>> dsForumId=38&dsMessageId=247465
>>>>> 
>>>>> To unsubscribe from this discussion, e-mail: [users- 
>>>>> unsubscribe at gridengine.sunsource.net
>>>>> ].
>>>> 
>>>> --
>>>> Dr Hristo Iliev
>>>> Monte Carlo research group
>>>> Faculty of Physics, University of Sofia
>>>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>>>> http://cluster.phys.uni-sofia.bg/hristo/
>>>> 
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>>> dsForumId=38&dsMessageId=247501
>>>> 
>>>> To unsubscribe from this discussion, e-mail: [users- 
>>>> unsubscribe at gridengine.sunsource.net
>>>> ].
>>>> 
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessageId=247503
>>> 
>>> To unsubscribe from this discussion, e-mail: [users- 
>>> unsubscribe at gridengine.sunsource.net].
>> 
>> --
>> Dr Hristo Iliev
>> Monte Carlo research group
>> Faculty of Physics, University of Sofia
>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>> http://cluster.phys.uni-sofia.bg/hristo/
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=247506
>> 
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>> 
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247694
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

--
Dr Hristo Iliev
Monte Carlo research group
Faculty of Physics, University of Sofia
5 James Bourchier blvd, 1164 Sofia, Bulgaria
http://cluster.phys.uni-sofia.bg/hristo/

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248538

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list