[GE users] qmaster memory problem (leak/bug?)

crei crei at sun.com
Wed Jun 24 16:25:04 BST 2009


Hi eric,

can you try to switch off the scheduler info messages by setting
schedd_job_info to false in scheduler configuration (qconf -msconf)

This feature was modified and perhaps there is a problem within.

(In order to get scheduler info you can also use qalter -w p)

Regards,

Christian

On 06/24/09 17:14, whitingeric wrote:
> Same problem report -- I should have searched the issues before I posted
> to the email list.
> Thanks.
> eric
> 
> 
> crei wrote:
>> There were new issues reported. Do you think the described problem might
>> be yours?
>>
>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=3050
>>
>> Christian
>>
>>
>> On 06/24/09 16:17, whitingeric wrote:
>>   
>>> messages below. Looks like 3 start/stop cycles below.. The final start
>>> worked -- I had disabled all execd hosts before the startup. Then I
>>> slowly (1 per second in a script) enabled the execd hosts.
>>>
>>> 06/23/2009 12:20:50|  main|helios|I|controlled shutdown 6.2u2_1
>>>
>>> 06/23/2009 12:21:12|  main|helios|I|read job database with 30 entries in
>>> 0 seconds
>>> 06/23/2009 12:21:12|  main|helios|E|error opening file
>>> "/local/sge/default/spool/qmaster/./sharetree" for reading: No such file
>>> or directory
>>> 06/23/2009 12:21:12|  main|helios|I|qmaster hard descriptor limit is set
>>> to 8192
>>> 06/23/2009 12:21:12|  main|helios|I|qmaster soft descriptor limit is set
>>> to 8192
>>> 06/23/2009 12:21:12|  main|helios|I|qmaster will use max. 8172 file
>>> descriptors for communication
>>> 06/23/2009 12:21:12|  main|helios|I|qmaster will accept max. 99 dynamic
>>> event clients
>>> 06/23/2009 12:21:12|  main|helios|I|starting up GE 6.2u2_1 (lx24-amd64)
>>> 06/23/2009 12:21:12|worker|helios|W|rule "default rule (spool dir)" in
>>> spooling context "flatfile spooling" failed writing an object
>>>
>>> 06/23/2009 12:23:35|  main|helios|E|jvm thread is not running
>>> 06/23/2009 12:23:44|  main|helios|I|controlled shutdown 6.2u2_1
>>>
>>> 06/23/2009 12:24:14|  main|helios|I|read job database with 29 entries in
>>> 0 seconds
>>> 06/23/2009 12:24:14|  main|helios|E|error opening file
>>> "/local/sge/default/spool/qmaster/./sharetree" for reading: No such file
>>> or directory
>>> 06/23/2009 12:24:14|  main|helios|I|qmaster hard descriptor limit is set
>>> to 8192
>>> 06/23/2009 12:24:14|  main|helios|I|qmaster soft descriptor limit is set
>>> to 8192
>>> 06/23/2009 12:24:14|  main|helios|I|qmaster will use max. 8172 file
>>> descriptors for communication
>>> 06/23/2009 12:24:14|  main|helios|I|qmaster will accept max. 99 dynamic
>>> event clients
>>> 06/23/2009 12:24:14|  main|helios|I|starting up GE 6.2u2_1 (lx24-amd64)
>>>
>>> 06/23/2009 12:26:15|  main|helios|E|jvm thread is not running
>>> 06/23/2009 12:26:26|  main|helios|I|controlled shutdown 6.2u2_1
>>>
>>> 06/23/2009 12:28:31|  main|helios|I|read job database with 29 entries in
>>> 0 seconds
>>> 06/23/2009 12:28:31|  main|helios|E|error opening file
>>> "/local/sge/default/spool/qmaster/./sharetree" for reading: No such file
>>> or directory
>>> 06/23/2009 12:28:31|  main|helios|I|qmaster hard descriptor limit is set
>>> to 8192
>>> 06/23/2009 12:28:31|  main|helios|I|qmaster soft descriptor limit is set
>>> to 8192
>>> 06/23/2009 12:28:31|  main|helios|I|qmaster will use max. 8172 file
>>> descriptors for communication
>>> 06/23/2009 12:28:31|  main|helios|I|qmaster will accept max. 99 dynamic
>>> event clients
>>> 06/23/2009 12:28:31|  main|helios|I|starting up GE 6.2u2_1 (lx24-amd64)
>>>
>>> QPING INFO BELOW:
>>>
>>> $ qping -info helios 6444 qmaster 1
>>> 06/24/2009 08:16:06:
>>> SIRM version:             0.1
>>> SIRM message id:          1
>>> start time:               06/23/2009 12:28:31 (1245781711)
>>> run time [s]:             71255
>>> messages in read buffer:  0
>>> messages in write buffer: 0
>>> nr. of connected clients: 150
>>> status:                   1
>>> info:                     MAIN: E (71255.40) | signaler000: E (71255.33)
>>> | event_master000: E (0.70) | timer000: E (5.71) | worker000: E (0.76) |
>>> worker001: E (0.70) | listener000: E (2.34) | listener001: E (0.70) |
>>> scheduler000: E (7.69) | WARNING
>>> malloc:                   arena(135168) |ordblks(1) | smblks(0) |
>>> hblksr(0) | hblhkd(0) usmblks(0) | fsmblks(0) | uordblks(144) |
>>> fordblks(135024) | keepcost(135024)
>>> Monitor:                  disabled
>>>
>>>
>>>
>>> crei wrote:
>>>     
>>>> Any information/warnings/errors in the qmaster messages file?
>>>>
>>>> What shows a qping -info to the qmaster daemon?
>>>>
>>>>
>>>> On 06/23/09 23:10, whitingeric wrote:
>>>>   
>>>>       
>>>>> See below for a qmaster that looks lost...
>>>>>
>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>>>  6642 sgeadmin  20   0 28.2g  13g 1968 S   77 83.2 207:39.01 sge_qmaster
>>>>>
>>>>> 6.2u2_1 -- installed about 2 months ago.
>>>>>
>>>>> qmaster usually runs about 20M of RSS -- then sometimes it starts to run
>>>>> away -- like right now... See above for 28G VIRT 13G RSS.  (about 150
>>>>> execd nodes)
>>>>>
>>>>> I kill sge and restart. Same thing -- it starts small and runs away.
>>>>> Below you can see it run away....
>>>>>
>>>>>
>>>>> # /etc/init.d/sgemaster  stop
>>>>>    Shutting down Grid Engine qmaster
>>>>>
>>>>> # /etc/init.d/sgemaster  start
>>>>>    starting sge_qmaster
>>>>>
>>>>>
>>>>> # while(true);do ps -aeo 'user,pid,rss,cmd' |grep qmast |grep
>>>>> admin;sleep 10;done
>>>>> sgeadmin 16493 6083100 /local/sge/bin/lx24-amd64/sge_qmaster
>>>>> sgeadmin 16493 3261316 /local/sge/bin/lx24-amd64/sge_qmaster
>>>>> sgeadmin 16493 5192468 /local/sge/bin/lx24-amd64/sge_qmaster
>>>>> sgeadmin 16493 6947132 /local/sge/bin/lx24-amd64/sge_qmaster
>>>>> sgeadmin 16493 8588192 /local/sge/bin/lx24-amd64/sge_qmaster
>>>>> sgeadmin 16493 10310248 /local/sge/bin/lx24-amd64/sge_qmaster
>>>>>
>>>>>
>>>>> Any help?
>>>>>
>>>>> I think the only way I have got it to restart is to disable all exec
>>>>> nodes and restart sge and then enable compute nodes slowly... Not a real
>>>>> scientific method.. Not a real fix.
>>>>>
>>>>> Thanks.
>>>>> eric
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203182
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>     
>>>>>         
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203308
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>     
>>
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203323
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

-- 
Sun Microsystems GmbH             Christian Reissmann
Dr.-Leo-Ritter-Str. 7             Software Engineer
D-93049 Regensburg                Phone: +49 (0)941 3075 112
Germany                           Fax:   +49 (0)941 3075 222
http://www.sun.de                 mailto: Christian.Reissmann at sun.com
                                   http://www.sun.com/gridengine
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203328

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list