[GE users] V6.1 scheduler performance
macmccalla at hess.com
Fri Apr 11 14:17:28 BST 2008
[ The following text is in the "utf-8" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some characters may be displayed incorrectly. ]
Hi Andy, hope you are doing well. I am now on vacation for a week, but will be glad to share all info when I get back to work.
----- Original Message -----
From: Andy Schwierskott <andy.schwierskott at sun.com>
To: users at gridengine.sunsource.net <users at gridengine.sunsource.net>
Sent: Fri Apr 11 05:43:49 2008
Subject: Re: [GE users] V6.1 scheduler performance
that's indeed extremely interesting.
Usually using resource quotas can easily increase the scheduling times,
sometimes even drastically. With 'little effort' (i.e. not excactly knowing
how to optimize the expression of a resource quota and not knowing the
implementation details) you can decrease the cluster throughput drastically.
We are doing a lot of research and investing some significant effort to
improve these issues. Your report is extremely encouraging! Over time we
also need to improve our documentation which describes who to configure well
bahaving RQ sets and which rules are problematic.
Can your share the RQ definition you had implemented along with an overview
about your Grid setup?
Had you activated scheduler monitoring in thepast and currently? This would
show if the new RQ definition could have caused this change.
> Just thought I would post some information about V6.1 scheduler
> behavior that I discovered today, in case it helps someone else. I
> recently upgraded our cluster from V6.0u7 to V6.1u3, in hopes that
> scheduling performance would improve but there was no perceptible change
> in scheduling rate. However, this morning I implemented a trivial
> resource quota set to limit the max number of user jobs per host. There
> had been no previous resource quota sets defined. Within 5 minutes, the
> scheduler cpu usage on our dedicated 4-cpu qmaster/scheduler machine was
> reduced from 100% of a processor (fairly typical behavior for the
> current workload), to 2-20% of a processor and time from submission to
> start for small jobs dropped from about 45 minutes to 15 to 30 seconds.
> No other change in configuration or workload took place (to my
> knowledge). The cluster context is fairly robust, with 2000 hosts, 12
> queues, 10 parallel environments, and about 9000 queue instances. I
> would be interested to know if any others have had similar experiences.
> Mac McCalla
> Geoscience Systems Development Advisor
> Hess Corporation
> One Allen Center
> 500 Dallas St.
> Houston, Texas 77002
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users