[GE users] Best practices in memory resource distribution?

txema_heredia txema.heredia at upf.edu
Thu Apr 8 15:17:01 BST 2010

Hi all,

I am the administrator of an SGE 6.1u4 cluster composed by 8 nodes, each of them with 8 cores, and going from 8Gb to 32Gb of memory.
Our users run a wide variety of jobs, from shabby handmade perl scripts, to well-known and well-made software like Blast; so the job length and amount of memory needed varies greatly, from a few Mb to more than 20 Gb.

Due to this, I had several problems while configuring our SGE's scheduling policy, specially in the memory usage, and I would be glad if you can throw some light into my problems.

First of all, my premises:

- I want to use as many cores as possible (ideally all 64)
- I want to use as much memory as possible (not having jobs delayed because, for example, the high memory nodes are filled with 8 low memory jobs, and the low memory nodes are filled with a couple of high memory jobs)
- I want a responsible system (not oversubscribed resources)
- I don't want to swap out jobs (specially running ones)
- I don't want jobs being unexpectedly killed due to memory reasons.

In order to achieve this, I have tried different approaches:

- Manage memory through h_vmem.
Set the h_vmem complex attribute as consumable, define it per host equal to the physical memory of the host, and force my users to use it or either use the default value (6 Gb).
This method is good, because it ensures no swapping (jobs cannot use more than their reserved h_vmem memory, and if they do, are immediately killed). 
As it is not possible to use more than the host physical memory, the system is always responsive and jobs don't screw each other (in memory terms). In addition to this, the jobs killed by memory are only due to user fault. Our users jobs tend to be batches of 100~1000 similar jobs, and I have taught them how to use qacct to get the maxvmem used by the first job of a batch and then extrapolate to the rest of the batch. Additionally to all this, I have aliased qsub so, when a job is qsubbed, it checks how much memory it is asking for, and it sends to the hostgroup (8Gb-nodes if h_vmem <= 1G, 16Gb if <= 2G, 32Gb else) where it is more likely to fit.
But this method has two big problems:

1 - Depends too much on users 'doing it right': If they do it, everything works fine. But, the memory reserving attribute is the same which kills jobs if exceeded, so users will tend to ask for more memory than needed, and thus, over-reserving memory that could be used by other user's jobs. And, in the case of the first job of a batch, they will need to reserve a standard chunk of memory (6Gb by default) in order to allow their job to finish without being killed, so they can get the maxvmem needed by the rest of the batch.

2 - Peaks of memory: Some jobs use a constant amount of memory. Some others use an increasing amount of memory over time. But lots of others use a constant amount of memory, but with peaks much higher than the mean. Those peaks force the job to be scheduled with h_vmem set to be able to handle the highest peak. This means that, for example, if we have a job that lasts for a week and usually uses 1 Gb of memory, but, at the very end has a peak of 7.5 Gb, you have submitted it with -l h_vmem=8G. In a 16Gb-RAM node you will only be able to schedule 2 of those jobs and nothing else. Then you will have a node with only 2 out of 8 cores used, and 2 out of 16 Gb of memory are used most of time (the peaks don't even have to happen at the same time!). This is a HUGE undersubscription of resources.

- Manage memory through suspend thresholds.
This method doesn't use h_vmem as a consumable (but you still specify it to avoid jobs doing nasty things), and doesn't ask users how much memory the jobs will use (even though it could be implemented as an additional attribute which does nothing). This way, any job which is submitted, is scheduled if there are available slots, so we have a 100% slot usage. Then, to control the memory usage of the host, a suspend threshold is set, checking over used memory, so once the memory threshold is reached, some jobs are suspended until there is memory available. 
But here come the problems:
Suspend thresholds are set to suspend jobs once the threshold is reached, and unsuspend them once it goes below the threshold. The problem is that there are no rules about holding a threshold. I mean, having a threshold of 10 Gb for instance, once mem_used gets to 10Gb, the suspend threshold triggers and suspends N jobs, but, after 1 minute (the suspend interval), mem_used still is 10 Gb or more, as mem_used doesn't get 'freed' after suspending a job, so the threshold is still surpassed, and more jobs are suspended, thus ending in an all-suspended host. To fix it, I created a self-made load sensor which gives the used amount of memory of the host (via the free command), minus the memory used by processes on suspended status (via ps). In this case, when 'real_used_memory' is 10Gb and reaches the threshold, some jobs are suspended, so, after 1 minute, as 'real_used_memory' has lowered to ,for instance, 6Gb, the jobs are resumed, thus making the next interval 'real_used_memory'=10G over the threshold again, suspending the jobs, and so on so forth. This way we end up having 1 period under threshold, and 1 period over threshold but with an uncontrollable amount of memory which can lead our host to swapping out running processes (which we don't want)

So I have two questions:
- Is there a way to tell suspend_threshold "suspend jobs when exceeding a first threshold BUT unsuspend them when they go below a second threshold" ? (apart of creating my own daemon)

- What memory policy do you use or how do you handle this?

Thank you,


PS: I am also thinking in a method combining a consumable mem_free (as a scheduling hint) in addition to setting automatically a h_vmem (equal to the physical mem of the host) for the job, but I am still thinking about it and it would need the suspend_thresholds.


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list