[GE users] SGE scheduler/qmaster performance

cjf001 john.foley at motorola.com
Thu Apr 22 19:23:57 BST 2010


OK, interesting....

The qping shows that I've apparently got some problems..... :

root at lxadml2# qping -i 5 -info lxadml2 735 qmaster 1
04/22/2010 13:14:27:
SIRM version:             0.1
SIRM message id:          1
start time:               04/21/2010 08:25:26 (1271856326)
run time [s]:             103741
messages in read buffer:  0
messages in write buffer: 0
nr. of connected clients: 189
status:                   1
info:                     MAIN: E (103740.47) | signaler000: E (103619.03) | event_master000: E (0.05) | timer000: E (5.85) | worker000: E (5.84) | worker001: E (5.84) | listener000: E (0.24) | listener001: E (5.25) | scheduler000: E (5.84) | WARNING
malloc:                   arena(0) |ordblks(1) | smblks(0) | hblksr(0) | hblhkd(0) usmblks(0) | fsmblks(0) | uordblks(0) | fordblks(0) | keepcost(0)
Monitor:                  disabled


"E"rrors all over the place ! Status "1" is "One or more threads has reached warning timeout."
according to the man page. BTW, the status changes between "1" and "2" occasionally during
an interval run of qping.

So, it appears that all the threads are running slow.  I'll try turning the profiling
on - I've done that in the past, I believe - it generates lots of output....

    Thanks,

      John


rayson wrote:
> May be you can try scheduler profiling or qping to dump the runtime
> status of qmaster.
>
> http://gridengine.info/2006/09/13/performance-profiling-information-added-to-cvs
> http://wiki.gridengine.info/wiki/index.php/GridEngine_qping
>
> Rayson
>
>
>
> On 4/22/10, cjf001<john.foley at motorola.com>  wrote:
>> SGEers:
>>
>> We're running SGEv6.2u2 here, and just this week I've started to
>> notice (and so have the users :(  ) very slow response from the
>> SGE qmaster. Most commands are very slow to respond, but the  test
>> I'm using is simply running a qstat. Right now it's taking about
>> 20 seconds to respond.
>>
>> Now, one thing that *may* have changed recently is the number of
>> jobs in the system (ie, running + pending jobs). I never really
>> tracked this number before, but right now, with the ~20 second
>> qstat response time, we have about 7700 jobs in the system. This
>> could be a lot more than we're used to, as one of our groups has
>> been submitting a ton of jobs recently.
>>
>> So, my questions are....
>>
>> - does it make sense that the qmaster/scheduler response would slow
>>    down with more jobs in the system ?
>>
>> - does anyone else run a comparable system, with this many or more
>>    jobs in the system, and if so, what are your qstat times ?
>>
>> - if this doesn't make sense (ie, isn't normal), what should I be looking
>>    for ?  The qmaster's messages file doesn't show anything abnormal.
>>    The system's message file (RHELv5.2) doesn't show anything abnormal.
>>    The rest of the cluster/network/etc seems to be running normally (ie,
>>    doesn't appear to be any network/NIS/DNS type issues). Is there a
>>    way to narrow down where the time is being spent ?
>>
>>       Thanks for any thoughts !
>>
>>               John
>>
>>
>> FYI, "top" on the qmaster machine shows this right now....
>>
>>
>> top - 09:24:54 up 1 day,  1:00,  5 users,  load average: 1.48, 1.44, 1.38
>> Tasks: 107 total,   2 running, 105 sleeping,   0 stopped,   0 zombie
>> Cpu(s): 20.3%us, 14.6%sy,  0.0%ni, 57.6%id,  0.2%wa,  0.2%hi,  7.2%si,  0.0%st
>> Mem:   3954768k total,  1594728k used,  2360040k free,   179320k buffers
>> Swap: 10241428k total,        0k used, 10241428k free,   545956k cached
>>
>>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>   6509 sgeadm    15   0  606m 380m 3224 S   76  9.8   1212:26 sge_qmaster
>>   6334 root      24   0  129m 2520 1428 S    0  0.1   0:51.74 automount
>>   9159 root      18   0 22568 3032 1448 S    0  0.1   0:00.02 qstat
>>   9165 root      15   0 18908 1412 1052 R    0  0.0   0:00.02 top
>>      1 root      15   0 10324  760  632 S    0  0.0   0:00.42 init
>>
>>
>>
>>
>>
>> --
>> ###########################################################################
>> # John Foley                          # Location:  IL93-E1-21S            #
>> # IT&  Systems Administration         # Maildrop:  IL93-E1-35O            #
>> # Antenna&  Mechanical Simulation Grp #    Email: john.foley at motorola.com #
>> # Motorola, Inc. -  Mobile Devices    #    Phone: (847) 523-8719          #
>> # 600 North US Highway 45             #      Fax: (847) 523-5767          #
>> # Libertyville, IL. 60048  (USA)      #     Cell: (847) 460-8719          #
>> ###########################################################################
>>                (this email sent using SeaMonkey on Windows)
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=254466
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=254474
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



-- 
###########################################################################
# John Foley                          # Location:  IL93-E1-21S            #
# IT & Systems Administration         # Maildrop:  IL93-E1-35O            #
# Antenna & Mechanical Simulation Grp #    Email: john.foley at motorola.com #
# Motorola, Inc. -  Mobile Devices    #    Phone: (847) 523-8719          #
# 600 North US Highway 45             #      Fax: (847) 523-5767          #
# Libertyville, IL. 60048  (USA)      #     Cell: (847) 460-8719          #
###########################################################################
               (this email sent using SeaMonkey on Windows)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=254487

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list