[GE users] GridEngine v5.3p1 eating too much memory

Andy Schwierskott andy.schwierskott at sun.com
Wed Apr 27 11:08:10 BST 2005


Richard,

there have been memory leaks fixed in 5.3. There have also been fixes in the
schedd-qmaster protocol in 5.3 which avoid memory overhead for certain
situations.

Please always check the list of fixes which have been done with patch
releases on the HOWTO pages

   http://gridengine.sunsource.net/project/gridengine/60patches.txt
   http://gridengine.sunsource.net/project/gridengine/53patches.txt

So 5.3p6 is at least your choice, but why not upgrading to 6.0? 6.0u4 will
be released next week or in the begining of the week of 05/09

Andy

> Hello,
>
> We are running GridEngine 5.3p1 (we never upgraded because we never had a
> problem), and we now have a problem.
>
> We have around 46 execution machines (totalling 130 CPUs), 8 submit hosts,
> and 1 qmaster, all running RedHat 8.0. We therefore have 130 queues in 'run'
> mode at any one time.
>
> When lots of jobs are submitted (300 or more), the sge_schedd process starts
> to consume memory at an alarming rate. With 331 jobs in the qstat output,
> and 130 running, sge_schedd occupied 55% of the memory according to 'top'.
> This however, did not cause a problem.
>
> But... When more than 300 jobs are submitted, like 500 or 1000 for example,
> this memory usage goes so high, that it uses up all the 1GB RAM, and the 2GB
> swap, and the machine either ends the process itself, or the process kills
> the entire qmaster machine, which then has to be rebooted and sometimes
> powered off.
>
> Has anyone seen this problem before? Is it a bug, or just a bad, inefficient
> algorithm within the scheduler's source code?
>
> Is there a fix available in a later patch level?
>
> Our workaround for the moment is for our researchers to check the grid
> before they submit their jobs, but this is not ideal because I am also
> having to monitor it non-stop. I guess a better workaround would be for the
> researcher's scripts to run a qstat and check the number of jobs before
> submitting new ones, but then they are basically writing their own
> scheduling software, when GridEngine is supposed to do it for them.
>
> Surely 1000 jobs and 130 queues isn't a lot, right?
>
> Any suggestions are very much appreciated.
>
> Thanks in advance,
> Richard Hobbs.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list