[GE users] SGE job memory limit problem

Reuti reuti at staff.uni-marburg.de
Mon Feb 19 17:03:35 GMT 2007


Hi,

Am 19.02.2007 um 17:36 schrieb Frank R Korzeniewski:

> hi:
>   my understanding of your reply is that it would be a good idea to
>   also protect any application by ignoring the XCPU signal. This
>   would be in addition to a fix to protect the shepherd. Thanks
>   i will add this in.
>
>   Your reply also had me thinking. I can just put a wrapper program
>   around the shepherd. So execd -> wrapper -> shepherd. I dont
>   have permissions on the SGE binary directory so this solution will
>   have to wait for tuesday when i can talk to the sys admins. Thanks
>   for triggering this solution also.

AFAIK only the kids of the shepherd will get the signal, hence the  
jobscipt (bash) and your program. I never saw the shepherd  
disappearing because of a sigxcpu.

reuti at node44:~> ps -e f -o pid,ppid,pgrp,command
   PID  PPID  PGRP COMMAND
1825     1  1825 /usr/sge/bin/lx24-x86/sge_execd
14663  1825 14663  \_ sge_shepherd-44513 -bg
14664 14663 14664      \_ /bin/sh /var/spool/sge/node44/job_scripts/ 
44513
14665 14664 14664          \_ /home/reuti/ever

processgroup 14664 will get the signal.

-- Reuti


>
>
>
> frank
>
>
> ----- Original Message -----
> From: Reuti <reuti at staff.uni-marburg.de>
> Date: Monday, February 19, 2007 12:01 am
> Subject: Re: [GE users] SGE job memory limit problem
> To: users at gridengine.sunsource.net
>
>> Hi,
>>
>> Am 19.02.2007 um 04:40 schrieb Frank R Korzeniewski:
>>
>>>  hi:
>>>   I have a problem with SGE 6.0u8. I am trying to limit the memory
>>>   used by processes on a linux cluster. I tried varying the memory
>>>   allocation parameters in SGE and the only one that has an effect
>>>   was s_vmem and h_vmem. The effect was not one i really desired.
>>>   I have a test program that allocates memory till it fails then
>>>   it loops referencing the allocated memory with a 4k byte span.
>>>   This simulates a program that allocates and uses as much memory
>>>   as it can. When i run this under the command line it works fine.
>>>   It runs forever (effectively). When i run it under SGE the job
>>>   gets terminated a little after it hits the max memory allocation.
>>>
>>>   The problem is a miscommunication between daemons/execd/
>>> execd_ck_to_do.c
>>>   and daemons/shepherd/shepherd.c. It seems that for some reason
>>>   the execd wants to enforce both cpu time limit and memory
>>> allocation.
>>>   If the hard limits are exceeded (cpu or memory) the execd sends
>>>   a SIGKILL to the job. If the soft limits are exceeded (cpu or
>>> memory)
>>>   the execd sends a SIGXCPU to the job. In shepherd.c it sets the
>>>   signal mask for the signals that it is going to handle. There is
>>>   the problem. It does not do anything qbout the SIGXCPU signal.
>>>   When this comes along the shepherd is killed. Now my program
>>>   sets the SIGXCPU signal to be ignored, and in my command line
>>>   tests it does not die when i send it a SIGXCPU signal.
>>>   When the shepherd dies the execd sees it and send a SIGKILL
>>>   to the process group. Bye bye my program.
>>
>> the SIGXCPU should be send to the complete process group, including
>>
>> the jobscript. So also there a handling of the signal is necessary.
>>
>> Do you have something like:
>>
>> trap '' xcpu
>>
>> in your jobscript, so that the handling is completely up to the
>> executing program only?
>>
>> -- Reuti
>>
>>
>>>   Is it really the intent that the job be killed when the soft
>>>   memory limit is exceeded? I think this is rather draconian.
>>>
>>>   If this is just a bug, where do i report it? I have not been
>>>   using SGE for very long (couple of months) and dont know
>>>   the procedure to report problems.
>>>
>>>   We are currently running a single job per node on our dual
>>>   cpu nodes to avoid load problems. We would like to go to
>>>   two or more depending on job memory requirements but we
>>>   cannot get the memory management under SGE to work
>>>   correctly. I have checked and the bug is still in the
>>>   update 10 sources. What do you think the chances are
>>>   of being able to get upddated binaries of the shepherd?
>>>
>>>
>>> frank
>>>
>>> ------------------------------------------------------------------
>> ---
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> --------------------------------------------------------------------
>> -
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list