[GE users] SGE master eats up all memory and crashes

arnuschky arne.brutschy at ulb.ac.be
Thu Jul 22 11:03:53 BST 2010


Hi,

actually, I think I solved the problem as reported in
http://gridengine.sunsource.net/issues/show_bug.cgi?id=3050

(I changed sched_job_info from true to false using qconf -msconf).

Sorry for the noise,
Arne

On Thu, 2010-07-22 at 10:57 +0200, arnuschky wrote:
> Hi all,
> 
> I have an installation of GE 6.2u4 on Rocks/Centos 5.3. It's a migrated
> version from an 6.0, now using bdb. qmaster starts up fine, and
> everything looks normal, but when I start adding jobs qmaster starts to
> eat up all memory:
>         
>           PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
>         32418 sge       25   0 2492m 2.3g 9588 S 99.9 66.6   3:17.56 sge_qmaster   
> 
> This goes on until no mem is left, and qmaster fails. Nothing in the
> logs apart the expected:
> 
>         $ tail /opt/gridengine/default/spool/qmaster/messages
>         07/22/2010 10:30:34|  main|headnode|I|qmaster hard descriptor limit is set to 8192
>         07/22/2010 10:30:34|  main|headnode|I|qmaster soft descriptor limit is set to 8192
>         07/22/2010 10:30:34|  main|headnode|I|qmaster will use max. 8172 file descriptors for communication
>         07/22/2010 10:30:34|  main|headnode|I|qmaster will accept max. 99 dynamic event clients
>         07/22/2010 10:30:34|  main|headnode|I|starting up GE 6.2u4 (lx26-x86)
>         07/22/2010 10:53:29|worker|majorana|E|not enough memory to allocate 1048576 bytes in init_packbuffer
>         07/22/2010 10:53:29|worker|majorana|C|realloc() failure
>         07/22/2010 10:54:40|worker|majorana|C|realloc() failure
>         07/22/2010 10:54:40|worker|majorana|E|error packing object with key "USER:abrutschy": can't allocate memory
>         07/22/2010 10:54:40|worker|majorana|W|aborting transaction (rollback)
>         07/22/2010 10:54:40|worker|majorana|E|not enough memory to allocate 1048576 bytes in init_packbuffer
>         07/22/2010 10:54:40|worker|majorana|C|realloc() failure
> 
> How can I debug this problem? What might be the cause? I remember that I
> had problems migrating, a mess up with the internal and the external
> hostname. In the end I had to patch gethostname to return the internal
> name (headnode.local) and not the external one (headnode.fqdn), else
> qmaster wouldn't start. Might this be a reason? Why can't I see any
> errors?
> 
> Arne
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=269640
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

-- 
Arne Brutschy
Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles   Tel      +32 2 650 3168
Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=269654

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list