[GE users] qmaster memory problem (leak/bug?)

whitingeric eric.whiting at inl.gov
Tue Jun 23 22:10:09 BST 2009


See below for a qmaster that looks lost...

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6642 sgeadmin  20   0 28.2g  13g 1968 S   77 83.2 207:39.01 sge_qmaster

6.2u2_1 -- installed about 2 months ago.

qmaster usually runs about 20M of RSS -- then sometimes it starts to run
away -- like right now... See above for 28G VIRT 13G RSS.  (about 150
execd nodes)

I kill sge and restart. Same thing -- it starts small and runs away.
Below you can see it run away....


# /etc/init.d/sgemaster  stop
   Shutting down Grid Engine qmaster

# /etc/init.d/sgemaster  start
   starting sge_qmaster


# while(true);do ps -aeo 'user,pid,rss,cmd' |grep qmast |grep
admin;sleep 10;done
sgeadmin 16493 6083100 /local/sge/bin/lx24-amd64/sge_qmaster
sgeadmin 16493 3261316 /local/sge/bin/lx24-amd64/sge_qmaster
sgeadmin 16493 5192468 /local/sge/bin/lx24-amd64/sge_qmaster
sgeadmin 16493 6947132 /local/sge/bin/lx24-amd64/sge_qmaster
sgeadmin 16493 8588192 /local/sge/bin/lx24-amd64/sge_qmaster
sgeadmin 16493 10310248 /local/sge/bin/lx24-amd64/sge_qmaster


Any help?

I think the only way I have got it to restart is to disable all exec
nodes and restart sge and then enable compute nodes slowly... Not a real
scientific method.. Not a real fix.

Thanks.
eric

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203182

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list