[GE users] scalability problems
bryan.bayerdorffer at analog.com
Mon Apr 5 19:24:20 BST 2004
Sean Dilda wrote:
> On Wed, 2004-03-17 at 08:58, Bryan Bayerdorffer wrote:
>>When there are more than about 5000 jobs pending, our qmaster loses contact
>>with all the exec hosts (about 100 hosts---all queues are shown as
>>"temporarily unavailable"), and no more jobs are dispatched. The qmaster host
>>spends about 80% cpu time in iowait---probably disk io. since there's not much
>>We are in the process of switching from LSF. The same host is also the LSF
>>master, although the LSF queues are empty and there are only a handful of
>>slaves still running. When SGE "hangs", I can still submit and run an LSF job
>>without delay, so the basics of the machine are ok.
>>What can I tune to help this? I read about max_jobs, but is that even in
>>SGEEE v5.3p4? Also, I don't really want submits to fail, since that would be
>>perceived by users as a limitation compared to LSF.
>>We routinely have ~20,000 jobs pending. Help appreciated!
> I had similar problems to yours. My setup is a lot different that
> yours, but I'm wondering if your problem might be related to mine. If
> you read my recent post
> (http://gridengine.sunsource.net/servlets/ReadMsg?msgId=17237&listName=users), you'll see that I used strace to determine that sge_qmaster was doing a lot of disk I/O in its queue that it didn't need to be doing. In my case it was over NFS, which made it even worse.
> When I fixed my spool files like Andy indicated, it significantly
> reduced the load sge_qmaster was producing. It reduced it enough that
> SGE seems to no longer be dropping nodes. (I had a test case that could
> reliably get SGE to drop random nodes)
> I don't know if you're seeing the same problem or not, but I think it
> might be worth looking into.
Interesting. I looked at exec_hosts the other day and about five hosts had
reschedule_unknown != NONE (low job numbers). I meant to try the fix just
now, but all the files have fixed themselves without shutting down qmaster.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users