[GE users] GridEngine v5.3p1 eating too much memory

Richard Hobbs richard.hobbs at crl.toshiba.co.uk
Wed Apr 27 10:43:16 BST 2005


Hello,

We are running GridEngine 5.3p1 (we never upgraded because we never had a
problem), and we now have a problem.

We have around 46 execution machines (totalling 130 CPUs), 8 submit hosts,
and 1 qmaster, all running RedHat 8.0. We therefore have 130 queues in 'run'
mode at any one time.

When lots of jobs are submitted (300 or more), the sge_schedd process starts
to consume memory at an alarming rate. With 331 jobs in the qstat output,
and 130 running, sge_schedd occupied 55% of the memory according to 'top'.
This however, did not cause a problem.

But... When more than 300 jobs are submitted, like 500 or 1000 for example,
this memory usage goes so high, that it uses up all the 1GB RAM, and the 2GB
swap, and the machine either ends the process itself, or the process kills
the entire qmaster machine, which then has to be rebooted and sometimes
powered off.

Has anyone seen this problem before? Is it a bug, or just a bad, inefficient
algorithm within the scheduler's source code?

Is there a fix available in a later patch level?

Our workaround for the moment is for our researchers to check the grid
before they submit their jobs, but this is not ideal because I am also
having to monitor it non-stop. I guess a better workaround would be for the
researcher's scripts to run a qstat and check the number of jobs before
submitting new ones, but then they are basically writing their own
scheduling software, when GridEngine is supposed to do it for them.

Surely 1000 jobs and 130 queues isn't a lot, right?

Any suggestions are very much appreciated.

Thanks in advance,
Richard Hobbs.

-- 
Richard Hobbs (Systems Administrator)
Toshiba Research Europe Ltd. - Speech Technology Group
Web: http://www.toshiba-europe.com/research/
Email: richard.hobbs at crl.toshiba.co.uk
Tel: +44 1223 376964        Mobile: +44 7811 803377



_____________________________________________________________________
This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list