[GE users] qmaster memory problem (leak/bug?)

whitingeric eric.whiting at inl.gov
Wed Jun 24 16:14:26 BST 2009


Same problem report -- I should have searched the issues before I posted
to the email list.
Thanks.
eric


crei wrote:
> There were new issues reported. Do you think the described problem might
> be yours?
>
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=3050
>
> Christian
>
>
> On 06/24/09 16:17, whitingeric wrote:
>   
>> messages below. Looks like 3 start/stop cycles below.. The final start
>> worked -- I had disabled all execd hosts before the startup. Then I
>> slowly (1 per second in a script) enabled the execd hosts.
>>
>> 06/23/2009 12:20:50|  main|helios|I|controlled shutdown 6.2u2_1
>>
>> 06/23/2009 12:21:12|  main|helios|I|read job database with 30 entries in
>> 0 seconds
>> 06/23/2009 12:21:12|  main|helios|E|error opening file
>> "/local/sge/default/spool/qmaster/./sharetree" for reading: No such file
>> or directory
>> 06/23/2009 12:21:12|  main|helios|I|qmaster hard descriptor limit is set
>> to 8192
>> 06/23/2009 12:21:12|  main|helios|I|qmaster soft descriptor limit is set
>> to 8192
>> 06/23/2009 12:21:12|  main|helios|I|qmaster will use max. 8172 file
>> descriptors for communication
>> 06/23/2009 12:21:12|  main|helios|I|qmaster will accept max. 99 dynamic
>> event clients
>> 06/23/2009 12:21:12|  main|helios|I|starting up GE 6.2u2_1 (lx24-amd64)
>> 06/23/2009 12:21:12|worker|helios|W|rule "default rule (spool dir)" in
>> spooling context "flatfile spooling" failed writing an object
>>
>> 06/23/2009 12:23:35|  main|helios|E|jvm thread is not running
>> 06/23/2009 12:23:44|  main|helios|I|controlled shutdown 6.2u2_1
>>
>> 06/23/2009 12:24:14|  main|helios|I|read job database with 29 entries in
>> 0 seconds
>> 06/23/2009 12:24:14|  main|helios|E|error opening file
>> "/local/sge/default/spool/qmaster/./sharetree" for reading: No such file
>> or directory
>> 06/23/2009 12:24:14|  main|helios|I|qmaster hard descriptor limit is set
>> to 8192
>> 06/23/2009 12:24:14|  main|helios|I|qmaster soft descriptor limit is set
>> to 8192
>> 06/23/2009 12:24:14|  main|helios|I|qmaster will use max. 8172 file
>> descriptors for communication
>> 06/23/2009 12:24:14|  main|helios|I|qmaster will accept max. 99 dynamic
>> event clients
>> 06/23/2009 12:24:14|  main|helios|I|starting up GE 6.2u2_1 (lx24-amd64)
>>
>> 06/23/2009 12:26:15|  main|helios|E|jvm thread is not running
>> 06/23/2009 12:26:26|  main|helios|I|controlled shutdown 6.2u2_1
>>
>> 06/23/2009 12:28:31|  main|helios|I|read job database with 29 entries in
>> 0 seconds
>> 06/23/2009 12:28:31|  main|helios|E|error opening file
>> "/local/sge/default/spool/qmaster/./sharetree" for reading: No such file
>> or directory
>> 06/23/2009 12:28:31|  main|helios|I|qmaster hard descriptor limit is set
>> to 8192
>> 06/23/2009 12:28:31|  main|helios|I|qmaster soft descriptor limit is set
>> to 8192
>> 06/23/2009 12:28:31|  main|helios|I|qmaster will use max. 8172 file
>> descriptors for communication
>> 06/23/2009 12:28:31|  main|helios|I|qmaster will accept max. 99 dynamic
>> event clients
>> 06/23/2009 12:28:31|  main|helios|I|starting up GE 6.2u2_1 (lx24-amd64)
>>
>> QPING INFO BELOW:
>>
>> $ qping -info helios 6444 qmaster 1
>> 06/24/2009 08:16:06:
>> SIRM version:             0.1
>> SIRM message id:          1
>> start time:               06/23/2009 12:28:31 (1245781711)
>> run time [s]:             71255
>> messages in read buffer:  0
>> messages in write buffer: 0
>> nr. of connected clients: 150
>> status:                   1
>> info:                     MAIN: E (71255.40) | signaler000: E (71255.33)
>> | event_master000: E (0.70) | timer000: E (5.71) | worker000: E (0.76) |
>> worker001: E (0.70) | listener000: E (2.34) | listener001: E (0.70) |
>> scheduler000: E (7.69) | WARNING
>> malloc:                   arena(135168) |ordblks(1) | smblks(0) |
>> hblksr(0) | hblhkd(0) usmblks(0) | fsmblks(0) | uordblks(144) |
>> fordblks(135024) | keepcost(135024)
>> Monitor:                  disabled
>>
>>
>>
>> crei wrote:
>>     
>>> Any information/warnings/errors in the qmaster messages file?
>>>
>>> What shows a qping -info to the qmaster daemon?
>>>
>>>
>>> On 06/23/09 23:10, whitingeric wrote:
>>>   
>>>       
>>>> See below for a qmaster that looks lost...
>>>>
>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>>  6642 sgeadmin  20   0 28.2g  13g 1968 S   77 83.2 207:39.01 sge_qmaster
>>>>
>>>> 6.2u2_1 -- installed about 2 months ago.
>>>>
>>>> qmaster usually runs about 20M of RSS -- then sometimes it starts to run
>>>> away -- like right now... See above for 28G VIRT 13G RSS.  (about 150
>>>> execd nodes)
>>>>
>>>> I kill sge and restart. Same thing -- it starts small and runs away.
>>>> Below you can see it run away....
>>>>
>>>>
>>>> # /etc/init.d/sgemaster  stop
>>>>    Shutting down Grid Engine qmaster
>>>>
>>>> # /etc/init.d/sgemaster  start
>>>>    starting sge_qmaster
>>>>
>>>>
>>>> # while(true);do ps -aeo 'user,pid,rss,cmd' |grep qmast |grep
>>>> admin;sleep 10;done
>>>> sgeadmin 16493 6083100 /local/sge/bin/lx24-amd64/sge_qmaster
>>>> sgeadmin 16493 3261316 /local/sge/bin/lx24-amd64/sge_qmaster
>>>> sgeadmin 16493 5192468 /local/sge/bin/lx24-amd64/sge_qmaster
>>>> sgeadmin 16493 6947132 /local/sge/bin/lx24-amd64/sge_qmaster
>>>> sgeadmin 16493 8588192 /local/sge/bin/lx24-amd64/sge_qmaster
>>>> sgeadmin 16493 10310248 /local/sge/bin/lx24-amd64/sge_qmaster
>>>>
>>>>
>>>> Any help?
>>>>
>>>> I think the only way I have got it to restart is to disable all exec
>>>> nodes and restart sge and then enable compute nodes slowly... Not a real
>>>> scientific method.. Not a real fix.
>>>>
>>>> Thanks.
>>>> eric
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203182
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>     
>>>>         
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203308
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>     
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203323

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list