[GE users] scalability problems

Charu Chaubal Charu.Chaubal at Sun.COM
Thu Apr 1 20:15:17 BST 2004


One other comment: unless you need it for HA purposes (eg, shadow  
master), then it's highly recommend to make the qmaster_spool_dir be  
local and NOT over NFS --- you could lose a lot of performance.

Regards,
	Charu

On Apr 1, 2004, at 10:46 AM, Sean Dilda wrote:

> On Wed, 2004-03-17 at 08:58, Bryan Bayerdorffer wrote:
>> When there are more than about 5000 jobs pending, our qmaster loses  
>> contact
>> with all the exec hosts (about 100 hosts---all queues are shown as
>> "temporarily unavailable"), and no more jobs are dispatched.  The  
>> qmaster host
>> spends about 80% cpu time in iowait---probably disk io. since there's  
>> not much
>> network traffic.
>>
>> We are in the process of switching from LSF.  The same host is also  
>> the LSF
>> master, although the LSF queues are empty and there are only a  
>> handful of
>> slaves still running.  When SGE "hangs", I can still submit and run  
>> an LSF job
>> without delay, so the basics of the machine are ok.
>>
>> What can I tune to help this?  I read about max_jobs, but is that  
>> even in
>> SGEEE v5.3p4?  Also, I don't really want submits to fail, since that  
>> would be
>> perceived by users as a limitation compared to LSF.
>>
>> We routinely have ~20,000 jobs pending.  Help appreciated!
>
> I had similar problems to yours.  My setup is a lot different that
> yours, but I'm wondering if your problem might be related to mine.  If
> you read my recent post
> (http://gridengine.sunsource.net/servlets/ReadMsg? 
> msgId=17237&listName=users), you'll see that I used strace to  
> determine that sge_qmaster was doing a lot of disk I/O in its queue  
> that it didn't need to be doing.  In my case it was over NFS, which  
> made it even worse.
>
> When I fixed my spool files like Andy indicated, it significantly
> reduced the load sge_qmaster was producing.  It reduced it enough that
> SGE seems to no longer be dropping nodes.  (I had a test case that  
> could
> reliably get SGE to drop random nodes)
>
> I don't know if you're seeing the same problem or not, but I think it
> might be worth looking into.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
########################################################
# Charu V. Chaubal				# Phone: (650) 786-7672 (x87672)
# Grid Computing Technologist	# Fax:   (650) 786-4591
# Sun Microsystems, Inc.			# Email: charu.chaubal at sun.com
########################################################




More information about the gridengine-users mailing list