[GE users] SGE job memory limit problem

Frank R Korzeniewski FRKorzeniewski at lbl.gov
Mon Feb 19 03:40:01 GMT 2007

  I have a problem with SGE 6.0u8. I am trying to limit the memory
  used by processes on a linux cluster. I tried varying the memory
  allocation parameters in SGE and the only one that has an effect
  was s_vmem and h_vmem. The effect was not one i really desired.
  I have a test program that allocates memory till it fails then
  it loops referencing the allocated memory with a 4k byte span.
  This simulates a program that allocates and uses as much memory
  as it can. When i run this under the command line it works fine.
  It runs forever (effectively). When i run it under SGE the job
  gets terminated a little after it hits the max memory allocation.

  The problem is a miscommunication between daemons/execd/execd_ck_to_do.c
  and daemons/shepherd/shepherd.c. It seems that for some reason
  the execd wants to enforce both cpu time limit and memory allocation.
  If the hard limits are exceeded (cpu or memory) the execd sends
  a SIGKILL to the job. If the soft limits are exceeded (cpu or memory)
  the execd sends a SIGXCPU to the job. In shepherd.c it sets the
  signal mask for the signals that it is going to handle. There is
  the problem. It does not do anything qbout the SIGXCPU signal.
  When this comes along the shepherd is killed. Now my program
  sets the SIGXCPU signal to be ignored, and in my command line
  tests it does not die when i send it a SIGXCPU signal.
  When the shepherd dies the execd sees it and send a SIGKILL
  to the process group. Bye bye my program.

  Is it really the intent that the job be killed when the soft
  memory limit is exceeded? I think this is rather draconian.

  If this is just a bug, where do i report it? I have not been
  using SGE for very long (couple of months) and dont know
  the procedure to report problems.

  We are currently running a single job per node on our dual
  cpu nodes to avoid load problems. We would like to go to
  two or more depending on job memory requirements but we
  cannot get the memory management under SGE to work
  correctly. I have checked and the bug is still in the
  update 10 sources. What do you think the chances are
  of being able to get upddated binaries of the shepherd?


To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list