[GE users] SGE job memory limit problem

Reuti reuti at staff.uni-marburg.de
Mon Feb 19 08:01:07 GMT 2007


Hi,

Am 19.02.2007 um 04:40 schrieb Frank R Korzeniewski:

>  hi:
>   I have a problem with SGE 6.0u8. I am trying to limit the memory
>   used by processes on a linux cluster. I tried varying the memory
>   allocation parameters in SGE and the only one that has an effect
>   was s_vmem and h_vmem. The effect was not one i really desired.
>   I have a test program that allocates memory till it fails then
>   it loops referencing the allocated memory with a 4k byte span.
>   This simulates a program that allocates and uses as much memory
>   as it can. When i run this under the command line it works fine.
>   It runs forever (effectively). When i run it under SGE the job
>   gets terminated a little after it hits the max memory allocation.
>
>   The problem is a miscommunication between daemons/execd/ 
> execd_ck_to_do.c
>   and daemons/shepherd/shepherd.c. It seems that for some reason
>   the execd wants to enforce both cpu time limit and memory  
> allocation.
>   If the hard limits are exceeded (cpu or memory) the execd sends
>   a SIGKILL to the job. If the soft limits are exceeded (cpu or  
> memory)
>   the execd sends a SIGXCPU to the job. In shepherd.c it sets the
>   signal mask for the signals that it is going to handle. There is
>   the problem. It does not do anything qbout the SIGXCPU signal.
>   When this comes along the shepherd is killed. Now my program
>   sets the SIGXCPU signal to be ignored, and in my command line
>   tests it does not die when i send it a SIGXCPU signal.
>   When the shepherd dies the execd sees it and send a SIGKILL
>   to the process group. Bye bye my program.

the SIGXCPU should be send to the complete process group, including  
the jobscript. So also there a handling of the signal is necessary.  
Do you have something like:

trap '' xcpu

in your jobscript, so that the handling is completely up to the  
executing program only?

-- Reuti


>   Is it really the intent that the job be killed when the soft
>   memory limit is exceeded? I think this is rather draconian.
>
>   If this is just a bug, where do i report it? I have not been
>   using SGE for very long (couple of months) and dont know
>   the procedure to report problems.
>
>   We are currently running a single job per node on our dual
>   cpu nodes to avoid load problems. We would like to go to
>   two or more depending on job memory requirements but we
>   cannot get the memory management under SGE to work
>   correctly. I have checked and the bug is still in the
>   update 10 sources. What do you think the chances are
>   of being able to get upddated binaries of the shepherd?
>
>
> frank
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list