[GE users] qmaster memory problem (leak/bug?)

whitingeric eric.whiting at inl.gov
Wed Jun 24 15:17:38 BST 2009


messages below. Looks like 3 start/stop cycles below.. The final start
worked -- I had disabled all execd hosts before the startup. Then I
slowly (1 per second in a script) enabled the execd hosts.

06/23/2009 12:20:50|  main|helios|I|controlled shutdown 6.2u2_1

06/23/2009 12:21:12|  main|helios|I|read job database with 30 entries in
0 seconds
06/23/2009 12:21:12|  main|helios|E|error opening file
"/local/sge/default/spool/qmaster/./sharetree" for reading: No such file
or directory
06/23/2009 12:21:12|  main|helios|I|qmaster hard descriptor limit is set
to 8192
06/23/2009 12:21:12|  main|helios|I|qmaster soft descriptor limit is set
to 8192
06/23/2009 12:21:12|  main|helios|I|qmaster will use max. 8172 file
descriptors for communication
06/23/2009 12:21:12|  main|helios|I|qmaster will accept max. 99 dynamic
event clients
06/23/2009 12:21:12|  main|helios|I|starting up GE 6.2u2_1 (lx24-amd64)
06/23/2009 12:21:12|worker|helios|W|rule "default rule (spool dir)" in
spooling context "flatfile spooling" failed writing an object

06/23/2009 12:23:35|  main|helios|E|jvm thread is not running
06/23/2009 12:23:44|  main|helios|I|controlled shutdown 6.2u2_1

06/23/2009 12:24:14|  main|helios|I|read job database with 29 entries in
0 seconds
06/23/2009 12:24:14|  main|helios|E|error opening file
"/local/sge/default/spool/qmaster/./sharetree" for reading: No such file
or directory
06/23/2009 12:24:14|  main|helios|I|qmaster hard descriptor limit is set
to 8192
06/23/2009 12:24:14|  main|helios|I|qmaster soft descriptor limit is set
to 8192
06/23/2009 12:24:14|  main|helios|I|qmaster will use max. 8172 file
descriptors for communication
06/23/2009 12:24:14|  main|helios|I|qmaster will accept max. 99 dynamic
event clients
06/23/2009 12:24:14|  main|helios|I|starting up GE 6.2u2_1 (lx24-amd64)

06/23/2009 12:26:15|  main|helios|E|jvm thread is not running
06/23/2009 12:26:26|  main|helios|I|controlled shutdown 6.2u2_1

06/23/2009 12:28:31|  main|helios|I|read job database with 29 entries in
0 seconds
06/23/2009 12:28:31|  main|helios|E|error opening file
"/local/sge/default/spool/qmaster/./sharetree" for reading: No such file
or directory
06/23/2009 12:28:31|  main|helios|I|qmaster hard descriptor limit is set
to 8192
06/23/2009 12:28:31|  main|helios|I|qmaster soft descriptor limit is set
to 8192
06/23/2009 12:28:31|  main|helios|I|qmaster will use max. 8172 file
descriptors for communication
06/23/2009 12:28:31|  main|helios|I|qmaster will accept max. 99 dynamic
event clients
06/23/2009 12:28:31|  main|helios|I|starting up GE 6.2u2_1 (lx24-amd64)

QPING INFO BELOW:

$ qping -info helios 6444 qmaster 1
06/24/2009 08:16:06:
SIRM version:             0.1
SIRM message id:          1
start time:               06/23/2009 12:28:31 (1245781711)
run time [s]:             71255
messages in read buffer:  0
messages in write buffer: 0
nr. of connected clients: 150
status:                   1
info:                     MAIN: E (71255.40) | signaler000: E (71255.33)
| event_master000: E (0.70) | timer000: E (5.71) | worker000: E (0.76) |
worker001: E (0.70) | listener000: E (2.34) | listener001: E (0.70) |
scheduler000: E (7.69) | WARNING
malloc:                   arena(135168) |ordblks(1) | smblks(0) |
hblksr(0) | hblhkd(0) usmblks(0) | fsmblks(0) | uordblks(144) |
fordblks(135024) | keepcost(135024)
Monitor:                  disabled



crei wrote:
> Any information/warnings/errors in the qmaster messages file?
>
> What shows a qping -info to the qmaster daemon?
>
>
> On 06/23/09 23:10, whitingeric wrote:
>   
>> See below for a qmaster that looks lost...
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  6642 sgeadmin  20   0 28.2g  13g 1968 S   77 83.2 207:39.01 sge_qmaster
>>
>> 6.2u2_1 -- installed about 2 months ago.
>>
>> qmaster usually runs about 20M of RSS -- then sometimes it starts to run
>> away -- like right now... See above for 28G VIRT 13G RSS.  (about 150
>> execd nodes)
>>
>> I kill sge and restart. Same thing -- it starts small and runs away.
>> Below you can see it run away....
>>
>>
>> # /etc/init.d/sgemaster  stop
>>    Shutting down Grid Engine qmaster
>>
>> # /etc/init.d/sgemaster  start
>>    starting sge_qmaster
>>
>>
>> # while(true);do ps -aeo 'user,pid,rss,cmd' |grep qmast |grep
>> admin;sleep 10;done
>> sgeadmin 16493 6083100 /local/sge/bin/lx24-amd64/sge_qmaster
>> sgeadmin 16493 3261316 /local/sge/bin/lx24-amd64/sge_qmaster
>> sgeadmin 16493 5192468 /local/sge/bin/lx24-amd64/sge_qmaster
>> sgeadmin 16493 6947132 /local/sge/bin/lx24-amd64/sge_qmaster
>> sgeadmin 16493 8588192 /local/sge/bin/lx24-amd64/sge_qmaster
>> sgeadmin 16493 10310248 /local/sge/bin/lx24-amd64/sge_qmaster
>>
>>
>> Any help?
>>
>> I think the only way I have got it to restart is to disable all exec
>> nodes and restart sge and then enable compute nodes slowly... Not a real
>> scientific method.. Not a real fix.
>>
>> Thanks.
>> eric
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203182
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>     
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203308

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list