[GE users] SGE job memory limit problem

Frank R Korzeniewski FRKorzeniewski at lbl.gov
Mon Feb 19 16:36:12 GMT 2007


hi:
  my understanding of your reply is that it would be a good idea to
  also protect any application by ignoring the XCPU signal. This
  would be in addition to a fix to protect the shepherd. Thanks
  i will add this in.

  Your reply also had me thinking. I can just put a wrapper program
  around the shepherd. So execd -> wrapper -> shepherd. I dont
  have permissions on the SGE binary directory so this solution will
  have to wait for tuesday when i can talk to the sys admins. Thanks
  for triggering this solution also.




frank


----- Original Message -----
From: Reuti <reuti at staff.uni-marburg.de>
Date: Monday, February 19, 2007 12:01 am
Subject: Re: [GE users] SGE job memory limit problem
To: users at gridengine.sunsource.net

> Hi,
> 
> Am 19.02.2007 um 04:40 schrieb Frank R Korzeniewski:
> 
> >  hi:
> >   I have a problem with SGE 6.0u8. I am trying to limit the memory
> >   used by processes on a linux cluster. I tried varying the memory
> >   allocation parameters in SGE and the only one that has an effect
> >   was s_vmem and h_vmem. The effect was not one i really desired.
> >   I have a test program that allocates memory till it fails then
> >   it loops referencing the allocated memory with a 4k byte span.
> >   This simulates a program that allocates and uses as much memory
> >   as it can. When i run this under the command line it works fine.
> >   It runs forever (effectively). When i run it under SGE the job
> >   gets terminated a little after it hits the max memory allocation.
> >
> >   The problem is a miscommunication between daemons/execd/ 
> > execd_ck_to_do.c
> >   and daemons/shepherd/shepherd.c. It seems that for some reason
> >   the execd wants to enforce both cpu time limit and memory  
> > allocation.
> >   If the hard limits are exceeded (cpu or memory) the execd sends
> >   a SIGKILL to the job. If the soft limits are exceeded (cpu or  
> > memory)
> >   the execd sends a SIGXCPU to the job. In shepherd.c it sets the
> >   signal mask for the signals that it is going to handle. There is
> >   the problem. It does not do anything qbout the SIGXCPU signal.
> >   When this comes along the shepherd is killed. Now my program
> >   sets the SIGXCPU signal to be ignored, and in my command line
> >   tests it does not die when i send it a SIGXCPU signal.
> >   When the shepherd dies the execd sees it and send a SIGKILL
> >   to the process group. Bye bye my program.
> 
> the SIGXCPU should be send to the complete process group, including 
> 
> the jobscript. So also there a handling of the signal is necessary. 
> 
> Do you have something like:
> 
> trap '' xcpu
> 
> in your jobscript, so that the handling is completely up to the  
> executing program only?
> 
> -- Reuti
> 
> 
> >   Is it really the intent that the job be killed when the soft
> >   memory limit is exceeded? I think this is rather draconian.
> >
> >   If this is just a bug, where do i report it? I have not been
> >   using SGE for very long (couple of months) and dont know
> >   the procedure to report problems.
> >
> >   We are currently running a single job per node on our dual
> >   cpu nodes to avoid load problems. We would like to go to
> >   two or more depending on job memory requirements but we
> >   cannot get the memory management under SGE to work
> >   correctly. I have checked and the bug is still in the
> >   update 10 sources. What do you think the chances are
> >   of being able to get upddated binaries of the shepherd?
> >
> >
> > frank
> >
> > ------------------------------------------------------------------
> ---
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> --------------------------------------------------------------------
> -
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list