[GE users] scalability problems

Bryan Bayerdorffer bryan.bayerdorffer at analog.com
Mon Apr 5 19:24:20 BST 2004



Sean Dilda wrote:
> On Wed, 2004-03-17 at 08:58, Bryan Bayerdorffer wrote:
> 
>>When there are more than about 5000 jobs pending, our qmaster loses contact 
>>with all the exec hosts (about 100 hosts---all queues are shown as 
>>"temporarily unavailable"), and no more jobs are dispatched.  The qmaster host 
>>spends about 80% cpu time in iowait---probably disk io. since there's not much 
>>network traffic.
>>
>>We are in the process of switching from LSF.  The same host is also the LSF 
>>master, although the LSF queues are empty and there are only a handful of 
>>slaves still running.  When SGE "hangs", I can still submit and run an LSF job 
>>without delay, so the basics of the machine are ok.
>>
>>What can I tune to help this?  I read about max_jobs, but is that even in 
>>SGEEE v5.3p4?  Also, I don't really want submits to fail, since that would be 
>>perceived by users as a limitation compared to LSF.
>>
>>We routinely have ~20,000 jobs pending.  Help appreciated!
> 
> 
> I had similar problems to yours.  My setup is a lot different that
> yours, but I'm wondering if your problem might be related to mine.  If
> you read my recent post
> (http://gridengine.sunsource.net/servlets/ReadMsg?msgId=17237&listName=users), you'll see that I used strace to determine that sge_qmaster was doing a lot of disk I/O in its queue that it didn't need to be doing.  In my case it was over NFS, which made it even worse.
> 
> When I fixed my spool files like Andy indicated, it significantly
> reduced the load sge_qmaster was producing.  It reduced it enough that
> SGE seems to no longer be dropping nodes.  (I had a test case that could
> reliably get SGE to drop random nodes)
> 
> I don't know if you're seeing the same problem or not, but I think it
> might be worth looking into.

Interesting.  I looked at exec_hosts the other day and about five hosts had 
reschedule_unknown != NONE (low job numbers).  I meant to try the fix just 
now, but all the files have fixed themselves without shutting down qmaster.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list