[GE users] virtual memory

bug bug at sas.upenn.edu
Tue Nov 2 19:18:17 GMT 2010

I have to give some background to get to my questions at the bottom...

I have a cluster of CentOS Linux machines, all with 8 cores, 8G of
memory and 2G of swap.  Exec nodes were crashing because of memory
exhaustion.  This has prompted me to implement hard memory limits and
make h_vmem consumable.  The main user application is Matlab Distributed
Computing Environment.

I set exec node attributes, in a for loop:
# qconf -rattr exechost complex_values slots=8,virtual_free=10G node01

I modify complex configuration setting h_vmem consumable to YES:
# qconf -mc

I set the default options for jobs in the
$SGE_ROOT/default/common/sge_request file:
-l h_vmem=2g
-l h_stack=128m

Note that if you do not set h_stack, Matlab and Python will refuse to start.

Users can still request whatever size they want, larger or smaller:
$ qsub -l h_vmem=4g -l h_stack=256m myjob.sh

I also enable job info by setting schedd_job_info to true:
# qconf -msconf

Now users can see how much memory their jobs are actually using, and get
some accounting info during or after the fact:
$ qstat -j $JOBNUM
$ qacct -j $JOBNUM

So, now I have a setup that will kill runaway jobs before they kill the
exec node.

If I look on the node a job is running, I see:
6560 myuser  17  0 1477m 226m  45m S 99.8  2.8 14:39.41  MATLAB

The process is only using 226m of physical memory.  Yes, the virtual
memory allocation is 1477m, but I assume that most of that is on disk or
dynamic libraries.

Why is the virtual so high?  Am I missing something?  Shouldn't the hard
limit be on the actual physical memory usage, not the virtual?  Is vmem
the only predictable metric for the memory footprint of a job instance?
 Is there a hard limit we can set on physical RAM, not virtual?

If we run five of these jobs, a node is full, but there is still free
cores and free physical ram.  This is not optimal.  How are others
reigning in their job memory usage effectively and still using the
system to the fullest?

Gavin W. Burris
Senior Systems Programmer
Information Security and Unix Systems
School of Arts and Sciences
University of Pennsylvania


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list