[GE users] Limit load on NFS server
beckerjes at mail.nih.gov
Tue May 12 20:21:19 BST 2009
> I have a cluster of about 30 compute nodes (~200 cores). The cluster has
> 8 NFS servers providing about 80TB of storage.
> If a single user starts 200 jobs doing (heavy) IO, over the network, on
> a single NFS server, it would not perform very well or it may even
> crash. I am trying to devise a method to limit the number of jobs
> accessing a single NFS server at once.
> At the moment my idea would be to create a set of consumable complex
> attributes, one for each nfs server, and have the users request one of
> them when submitting jobs doing IO on a particular NFS server. In this
> way the maximum number of jobs accessing at once a given NFS server can
> be limited.
> I don't like this idea very much though, if the jobs are just doing IO
> at the beginning of the script this approach would stop other jobs from
> being executed even after the load on the nfs server is back to normal.
> So I was looking at some better way to dynamically limit the load on the
> NFS servers. Any suggestions?
I've been thinking about this as well. One thought I had was to create a
load_sensor, and set load_thresholds on the queues. I haven't tested this at
all, but perhaps something like this:
For each NFS server, there would be a complex called something like
"nfs_load_SERVERNAME". This would be updated by a load_sensor, most likely
running on the NFS servers. The queues would then set a load threshold for
these complexes value, with whatever value is appropriate for your systems.
Thus, as the load on the NFS servers rise, the queues would be trip the load
thresholds, and no new jobs would be dispatched to the queues. Unfortunately,
I suspect that this may cause some problems with load oscillation, as large
numbers are jobs are dispatched all at once. Perhaps if load_adjustments in
the scheduler are used this could be avoided.
The downside to this is that you need to set these thresholds for all queues
you care about. Of course, you could do clever things by making the complex
FORCED, so users would have to request it, and thus be slightly aware of the
NHGRI Linux support (Digicon Contractor)
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users