[GE users] sge_sched getting killed by oom

Chris Harwell supercrh at gmail.com
Thu Dec 27 15:52:55 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

I'm having some trouble with sge. I am using GE 6.1u2 on an IBM x346 with 2
physical CPUs and HT on with 4GB of physical RAM and 18GB of swap.

 uname -a
Linux lcsge3.na.novartis.net 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT
2006 x86_64 x86_64 x86_64 GNU/Linux
rpm -q redhat-release
redhat-release-4AS-5.5.x86_64

and three times now sge_sched has been killed by the Linux oom killer. One
of the frustrating things is that it seems in this scenario that the shadow
masters don't notice this as a problem and so they do not take over. This
leaves jobs in the pending queue that ought to be scheduled.

So, are there any known memory leaks with the above configuration that could
be fixed by upgrading to the latest ( 6.1u3 last I checked )?  What are the
ballpark ranges for the amount of memory sge sched should consume? Should I
just add some RAM to the box or move sched to another box? Presumably this
is based on the number of jobs and hosts and queues? I have about 223 hosts.

                running #jobs/#slots
Owner        serial   parallel    total
---------------------------------------
user1      6/  6     0/  0     6/  6
user2     17/ 17     0/  0    17/ 17
user3      1/  1     0/  0     1/  1
user4       113/113     0/  0   113/113
user5      1/  1     0/  0     1/  1
user6       1/  1     0/  0     1/  1
---------------------------------------
Sum         139/139     0/  0   139/139


                waiting #jobs/#slots
Owner        serial   parallel    total
---------------------------------------
user7      0/  0     1/ 20     1/ 20
user4       596/596     0/  0   596/596
---------------------------------------
Sum         596/596     1/ 20   597/616


Should I provide more details of my specific SGE configuration?  Or could
someone point me to some documentation on reducing the memory usage of
sge_sched?


Thanks for any pointers,
Chris



More information about the gridengine-users mailing list