[GE users] SGE job memory limit problem
Frank R Korzeniewski
FRKorzeniewski at lbl.gov
Mon Feb 19 03:40:01 GMT 2007
I have a problem with SGE 6.0u8. I am trying to limit the memory
used by processes on a linux cluster. I tried varying the memory
allocation parameters in SGE and the only one that has an effect
was s_vmem and h_vmem. The effect was not one i really desired.
I have a test program that allocates memory till it fails then
it loops referencing the allocated memory with a 4k byte span.
This simulates a program that allocates and uses as much memory
as it can. When i run this under the command line it works fine.
It runs forever (effectively). When i run it under SGE the job
gets terminated a little after it hits the max memory allocation.
The problem is a miscommunication between daemons/execd/execd_ck_to_do.c
and daemons/shepherd/shepherd.c. It seems that for some reason
the execd wants to enforce both cpu time limit and memory allocation.
If the hard limits are exceeded (cpu or memory) the execd sends
a SIGKILL to the job. If the soft limits are exceeded (cpu or memory)
the execd sends a SIGXCPU to the job. In shepherd.c it sets the
signal mask for the signals that it is going to handle. There is
the problem. It does not do anything qbout the SIGXCPU signal.
When this comes along the shepherd is killed. Now my program
sets the SIGXCPU signal to be ignored, and in my command line
tests it does not die when i send it a SIGXCPU signal.
When the shepherd dies the execd sees it and send a SIGKILL
to the process group. Bye bye my program.
Is it really the intent that the job be killed when the soft
memory limit is exceeded? I think this is rather draconian.
If this is just a bug, where do i report it? I have not been
using SGE for very long (couple of months) and dont know
the procedure to report problems.
We are currently running a single job per node on our dual
cpu nodes to avoid load problems. We would like to go to
two or more depending on job memory requirements but we
cannot get the memory management under SGE to work
correctly. I have checked and the bug is still in the
update 10 sources. What do you think the chances are
of being able to get upddated binaries of the shepherd?
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users