[GE users] execd behaviour in case of qmaster crash

ah_sunsource ahaupt at ifh.de
Thu Jun 11 15:25:09 BST 2009


Hi again,

I could reproduce the behaviour. With some jobs the qmaster consumes
really *huge* amounts of memory and crashes the system. That's the last
thing I saw in "top".

top - 16:21:04 up 29 min,  1 user,  load average: 1.35, 0.34, 0.11
Tasks:  83 total,   2 running,  81 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3%us, 46.3%sy,  0.0%ni,  8.7%id, 44.6%wa,  0.0%hi,  0.0%si,  0.1%st
Mem:   4194304k total,  4185452k used,     8852k free,      260k buffers
Swap:  1052216k total,  1052216k used,        0k free,     1468k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 2808 sge       15   0 5103m 3.8g  548 S 41.6 96.1   0:44.50 sge_qmaster        

Is there a way to limit the memory consumption of the qmaster process
somehow? Or is there a recommendation how much memory a master host
should have installed to avoid swapping?

Cheers,
Andreas

On Thu, 2009-06-11 at 09:57 +0200, ah_sunsource wrote:
> Hi Rayson,
> 
> On Thu, 2009-06-11 at 02:10 -0500, rayson wrote:
> > Hi,
> > 
> > So what is the exact state of the master when this happens?? Is the
> > machine up but the qmaster process dead??
> 
> The host is still up. But I cannot login any more. The qmaster process
> still seems to run  - also the tcp socket is still reachable:
> 
> [hpbl1] ~ # telnet lolek-vm1 sge_qmaster
> Trying 141.34.32.95...
> Connected to lolek-vm1.
> Escape character is '^]'.
> ^]
> telnet> quit
> [hpbl1] ~ # getent services sge_qmaster
> sge_qmaster           538/tcp
> [hpbl1] ~ # qping -info lolek-vm1 538 qmaster 1
> endpoint lolek-vm1.ifh.de/qmaster/1 at port 538: can't find connection
> 
> Any communication simply hangs.
> 
> Cheers,
> Andreas

-- 
| Andreas Haupt             | E-Mail: andreas.haupt at desy.de
|  DESY Zeuthen             | WWW:    http://www-zeuthen.desy.de/~ahaupt
|  Platanenallee 6          | Phone:  +49/33762/7-7359
|  D-15738 Zeuthen          | Fax:    +49/33762/7-7216

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=201552

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list