[GE users] qmaster memory problem (leak/bug?)

crei crei at sun.com
Wed Jun 24 16:01:41 BST 2009


There were new issues reported. Do you think the described problem might
be yours?

http://gridengine.sunsource.net/issues/show_bug.cgi?id=3050

Christian


On 06/24/09 16:17, whitingeric wrote:
> messages below. Looks like 3 start/stop cycles below.. The final start
> worked -- I had disabled all execd hosts before the startup. Then I
> slowly (1 per second in a script) enabled the execd hosts.
> 
> 06/23/2009 12:20:50|  main|helios|I|controlled shutdown 6.2u2_1
> 
> 06/23/2009 12:21:12|  main|helios|I|read job database with 30 entries in
> 0 seconds
> 06/23/2009 12:21:12|  main|helios|E|error opening file
> "/local/sge/default/spool/qmaster/./sharetree" for reading: No such file
> or directory
> 06/23/2009 12:21:12|  main|helios|I|qmaster hard descriptor limit is set
> to 8192
> 06/23/2009 12:21:12|  main|helios|I|qmaster soft descriptor limit is set
> to 8192
> 06/23/2009 12:21:12|  main|helios|I|qmaster will use max. 8172 file
> descriptors for communication
> 06/23/2009 12:21:12|  main|helios|I|qmaster will accept max. 99 dynamic
> event clients
> 06/23/2009 12:21:12|  main|helios|I|starting up GE 6.2u2_1 (lx24-amd64)
> 06/23/2009 12:21:12|worker|helios|W|rule "default rule (spool dir)" in
> spooling context "flatfile spooling" failed writing an object
> 
> 06/23/2009 12:23:35|  main|helios|E|jvm thread is not running
> 06/23/2009 12:23:44|  main|helios|I|controlled shutdown 6.2u2_1
> 
> 06/23/2009 12:24:14|  main|helios|I|read job database with 29 entries in
> 0 seconds
> 06/23/2009 12:24:14|  main|helios|E|error opening file
> "/local/sge/default/spool/qmaster/./sharetree" for reading: No such file
> or directory
> 06/23/2009 12:24:14|  main|helios|I|qmaster hard descriptor limit is set
> to 8192
> 06/23/2009 12:24:14|  main|helios|I|qmaster soft descriptor limit is set
> to 8192
> 06/23/2009 12:24:14|  main|helios|I|qmaster will use max. 8172 file
> descriptors for communication
> 06/23/2009 12:24:14|  main|helios|I|qmaster will accept max. 99 dynamic
> event clients
> 06/23/2009 12:24:14|  main|helios|I|starting up GE 6.2u2_1 (lx24-amd64)
> 
> 06/23/2009 12:26:15|  main|helios|E|jvm thread is not running
> 06/23/2009 12:26:26|  main|helios|I|controlled shutdown 6.2u2_1
> 
> 06/23/2009 12:28:31|  main|helios|I|read job database with 29 entries in
> 0 seconds
> 06/23/2009 12:28:31|  main|helios|E|error opening file
> "/local/sge/default/spool/qmaster/./sharetree" for reading: No such file
> or directory
> 06/23/2009 12:28:31|  main|helios|I|qmaster hard descriptor limit is set
> to 8192
> 06/23/2009 12:28:31|  main|helios|I|qmaster soft descriptor limit is set
> to 8192
> 06/23/2009 12:28:31|  main|helios|I|qmaster will use max. 8172 file
> descriptors for communication
> 06/23/2009 12:28:31|  main|helios|I|qmaster will accept max. 99 dynamic
> event clients
> 06/23/2009 12:28:31|  main|helios|I|starting up GE 6.2u2_1 (lx24-amd64)
> 
> QPING INFO BELOW:
> 
> $ qping -info helios 6444 qmaster 1
> 06/24/2009 08:16:06:
> SIRM version:             0.1
> SIRM message id:          1
> start time:               06/23/2009 12:28:31 (1245781711)
> run time [s]:             71255
> messages in read buffer:  0
> messages in write buffer: 0
> nr. of connected clients: 150
> status:                   1
> info:                     MAIN: E (71255.40) | signaler000: E (71255.33)
> | event_master000: E (0.70) | timer000: E (5.71) | worker000: E (0.76) |
> worker001: E (0.70) | listener000: E (2.34) | listener001: E (0.70) |
> scheduler000: E (7.69) | WARNING
> malloc:                   arena(135168) |ordblks(1) | smblks(0) |
> hblksr(0) | hblhkd(0) usmblks(0) | fsmblks(0) | uordblks(144) |
> fordblks(135024) | keepcost(135024)
> Monitor:                  disabled
> 
> 
> 
> crei wrote:
>> Any information/warnings/errors in the qmaster messages file?
>>
>> What shows a qping -info to the qmaster daemon?
>>
>>
>> On 06/23/09 23:10, whitingeric wrote:
>>   
>>> See below for a qmaster that looks lost...
>>>
>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>  6642 sgeadmin  20   0 28.2g  13g 1968 S   77 83.2 207:39.01 sge_qmaster
>>>
>>> 6.2u2_1 -- installed about 2 months ago.
>>>
>>> qmaster usually runs about 20M of RSS -- then sometimes it starts to run
>>> away -- like right now... See above for 28G VIRT 13G RSS.  (about 150
>>> execd nodes)
>>>
>>> I kill sge and restart. Same thing -- it starts small and runs away.
>>> Below you can see it run away....
>>>
>>>
>>> # /etc/init.d/sgemaster  stop
>>>    Shutting down Grid Engine qmaster
>>>
>>> # /etc/init.d/sgemaster  start
>>>    starting sge_qmaster
>>>
>>>
>>> # while(true);do ps -aeo 'user,pid,rss,cmd' |grep qmast |grep
>>> admin;sleep 10;done
>>> sgeadmin 16493 6083100 /local/sge/bin/lx24-amd64/sge_qmaster
>>> sgeadmin 16493 3261316 /local/sge/bin/lx24-amd64/sge_qmaster
>>> sgeadmin 16493 5192468 /local/sge/bin/lx24-amd64/sge_qmaster
>>> sgeadmin 16493 6947132 /local/sge/bin/lx24-amd64/sge_qmaster
>>> sgeadmin 16493 8588192 /local/sge/bin/lx24-amd64/sge_qmaster
>>> sgeadmin 16493 10310248 /local/sge/bin/lx24-amd64/sge_qmaster
>>>
>>>
>>> Any help?
>>>
>>> I think the only way I have got it to restart is to disable all exec
>>> nodes and restart sge and then enable compute nodes slowly... Not a real
>>> scientific method.. Not a real fix.
>>>
>>> Thanks.
>>> eric
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203182
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>     
>>
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203308
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

-- 
Sun Microsystems GmbH             Christian Reissmann
Dr.-Leo-Ritter-Str. 7             Software Engineer
D-93049 Regensburg                Phone: +49 (0)941 3075 112
Germany                           Fax:   +49 (0)941 3075 222
http://www.sun.de                 mailto: Christian.Reissmann at sun.com
                                   http://www.sun.com/gridengine
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203319

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list