[GE users] scalability problems

Sean Dilda agrajag at dragaera.net
Thu Apr 1 19:46:01 BST 2004

On Wed, 2004-03-17 at 08:58, Bryan Bayerdorffer wrote:
> When there are more than about 5000 jobs pending, our qmaster loses contact 
> with all the exec hosts (about 100 hosts---all queues are shown as 
> "temporarily unavailable"), and no more jobs are dispatched.  The qmaster host 
> spends about 80% cpu time in iowait---probably disk io. since there's not much 
> network traffic.
> We are in the process of switching from LSF.  The same host is also the LSF 
> master, although the LSF queues are empty and there are only a handful of 
> slaves still running.  When SGE "hangs", I can still submit and run an LSF job 
> without delay, so the basics of the machine are ok.
> What can I tune to help this?  I read about max_jobs, but is that even in 
> SGEEE v5.3p4?  Also, I don't really want submits to fail, since that would be 
> perceived by users as a limitation compared to LSF.
> We routinely have ~20,000 jobs pending.  Help appreciated!

I had similar problems to yours.  My setup is a lot different that
yours, but I'm wondering if your problem might be related to mine.  If
you read my recent post
(http://gridengine.sunsource.net/servlets/ReadMsg?msgId=17237&listName=users), you'll see that I used strace to determine that sge_qmaster was doing a lot of disk I/O in its queue that it didn't need to be doing.  In my case it was over NFS, which made it even worse.

When I fixed my spool files like Andy indicated, it significantly
reduced the load sge_qmaster was producing.  It reduced it enough that
SGE seems to no longer be dropping nodes.  (I had a test case that could
reliably get SGE to drop random nodes)

I don't know if you're seeing the same problem or not, but I think it
might be worth looking into.

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list