[GE users] SGE master eats up all memory and crashes

arnuschky arne.brutschy at ulb.ac.be
Thu Jul 22 09:57:06 BST 2010


Hi all,

I have an installation of GE 6.2u4 on Rocks/Centos 5.3. It's a migrated
version from an 6.0, now using bdb. qmaster starts up fine, and
everything looks normal, but when I start adding jobs qmaster starts to
eat up all memory:
        
          PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
        32418 sge       25   0 2492m 2.3g 9588 S 99.9 66.6   3:17.56 sge_qmaster   

This goes on until no mem is left, and qmaster fails. Nothing in the
logs apart the expected:

        $ tail /opt/gridengine/default/spool/qmaster/messages
        07/22/2010 10:30:34|  main|headnode|I|qmaster hard descriptor limit is set to 8192
        07/22/2010 10:30:34|  main|headnode|I|qmaster soft descriptor limit is set to 8192
        07/22/2010 10:30:34|  main|headnode|I|qmaster will use max. 8172 file descriptors for communication
        07/22/2010 10:30:34|  main|headnode|I|qmaster will accept max. 99 dynamic event clients
        07/22/2010 10:30:34|  main|headnode|I|starting up GE 6.2u4 (lx26-x86)
        07/22/2010 10:53:29|worker|majorana|E|not enough memory to allocate 1048576 bytes in init_packbuffer
        07/22/2010 10:53:29|worker|majorana|C|realloc() failure
        07/22/2010 10:54:40|worker|majorana|C|realloc() failure
        07/22/2010 10:54:40|worker|majorana|E|error packing object with key "USER:abrutschy": can't allocate memory
        07/22/2010 10:54:40|worker|majorana|W|aborting transaction (rollback)
        07/22/2010 10:54:40|worker|majorana|E|not enough memory to allocate 1048576 bytes in init_packbuffer
        07/22/2010 10:54:40|worker|majorana|C|realloc() failure

How can I debug this problem? What might be the cause? I remember that I
had problems migrating, a mess up with the internal and the external
hostname. In the end I had to patch gethostname to return the internal
name (headnode.local) and not the external one (headnode.fqdn), else
qmaster wouldn't start. Might this be a reason? Why can't I see any
errors?

Arne

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=269640

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list