[GE users] SGE job memory limit problem

Frank R Korzeniewski FRKorzeniewski at lbl.gov
Mon Feb 19 18:34:13 GMT 2007


hi:
  Okay, i just learned a bunch about SGE. Things were not quite working the
  way i thought they did.

  The execution flow is: execd -> shepherd -> csh -> mem.c
  It is csh that is exiting on the XCPU signal. the trap does not work with
  csh. I dont understand why the shepherd is not killed by the XCPU.
  It does nothing to stop that signal.

  The sge conf file has shell_start_mode=posix_compliant so the shell
  parameter of the queu configuration is used. All the queues are
  configured for /bin/csh. So its my own fault.

  I changed the queue to shell=/bin/sh. Added the trap in my .profile
  like you suggested. Now it works fine. Thank you very much for the
  help.




frank


----- Original Message -----
From: Reuti <reuti at staff.uni-marburg.de>
Date: Monday, February 19, 2007 9:03 am
Subject: Re: [GE users] SGE job memory limit problem
To: users at gridengine.sunsource.net

> Hi,
> 
> Am 19.02.2007 um 17:36 schrieb Frank R Korzeniewski:
> 
> > hi:
> >   my understanding of your reply is that it would be a good idea to
> >   also protect any application by ignoring the XCPU signal. This
> >   would be in addition to a fix to protect the shepherd. Thanks
> >   i will add this in.
> >
> >   Your reply also had me thinking. I can just put a wrapper program
> >   around the shepherd. So execd -> wrapper -> shepherd. I dont
> >   have permissions on the SGE binary directory so this solution will
> >   have to wait for tuesday when i can talk to the sys admins. Thanks
> >   for triggering this solution also.
> 
> AFAIK only the kids of the shepherd will get the signal, hence the  
> jobscipt (bash) and your program. I never saw the shepherd  
> disappearing because of a sigxcpu.
> 
> reuti at node44:~> ps -e f -o pid,ppid,pgrp,command
>   PID  PPID  PGRP COMMAND
> 1825     1  1825 /usr/sge/bin/lx24-x86/sge_execd
> 14663  1825 14663  \_ sge_shepherd-44513 -bg
> 14664 14663 14664      \_ /bin/sh 
> /var/spool/sge/node44/job_scripts/ 
> 44513
> 14665 14664 14664          \_ /home/reuti/ever
> 
> processgroup 14664 will get the signal.
> 
> -- Reuti
> 
> 
> >
> >
> >
> > frank
> >
> >
> > ----- Original Message -----
> > From: Reuti <reuti at staff.uni-marburg.de>
> > Date: Monday, February 19, 2007 12:01 am
> > Subject: Re: [GE users] SGE job memory limit problem
> > To: users at gridengine.sunsource.net
> >
> >> Hi,
> >>
> >> Am 19.02.2007 um 04:40 schrieb Frank R Korzeniewski:
> >>
> >>>  hi:
> >>>   I have a problem with SGE 6.0u8. I am trying to limit the memory
> >>>   used by processes on a linux cluster. I tried varying the memory
> >>>   allocation parameters in SGE and the only one that has an effect
> >>>   was s_vmem and h_vmem. The effect was not one i really desired.
> >>>   I have a test program that allocates memory till it fails then
> >>>   it loops referencing the allocated memory with a 4k byte span.
> >>>   This simulates a program that allocates and uses as much memory
> >>>   as it can. When i run this under the command line it works fine.
> >>>   It runs forever (effectively). When i run it under SGE the job
> >>>   gets terminated a little after it hits the max memory 
> allocation.>>>
> >>>   The problem is a miscommunication between daemons/execd/
> >>> execd_ck_to_do.c
> >>>   and daemons/shepherd/shepherd.c. It seems that for some reason
> >>>   the execd wants to enforce both cpu time limit and memory
> >>> allocation.
> >>>   If the hard limits are exceeded (cpu or memory) the execd sends
> >>>   a SIGKILL to the job. If the soft limits are exceeded (cpu or
> >>> memory)
> >>>   the execd sends a SIGXCPU to the job. In shepherd.c it sets the
> >>>   signal mask for the signals that it is going to handle. There is
> >>>   the problem. It does not do anything qbout the SIGXCPU signal.
> >>>   When this comes along the shepherd is killed. Now my program
> >>>   sets the SIGXCPU signal to be ignored, and in my command line
> >>>   tests it does not die when i send it a SIGXCPU signal.
> >>>   When the shepherd dies the execd sees it and send a SIGKILL
> >>>   to the process group. Bye bye my program.
> >>
> >> the SIGXCPU should be send to the complete process group, including
> >>
> >> the jobscript. So also there a handling of the signal is necessary.
> >>
> >> Do you have something like:
> >>
> >> trap '' xcpu
> >>
> >> in your jobscript, so that the handling is completely up to the
> >> executing program only?
> >>
> >> -- Reuti
> >>
> >>
> >>>   Is it really the intent that the job be killed when the soft
> >>>   memory limit is exceeded? I think this is rather draconian.
> >>>
> >>>   If this is just a bug, where do i report it? I have not been
> >>>   using SGE for very long (couple of months) and dont know
> >>>   the procedure to report problems.
> >>>
> >>>   We are currently running a single job per node on our dual
> >>>   cpu nodes to avoid load problems. We would like to go to
> >>>   two or more depending on job memory requirements but we
> >>>   cannot get the memory management under SGE to work
> >>>   correctly. I have checked and the bug is still in the
> >>>   update 10 sources. What do you think the chances are
> >>>   of being able to get upddated binaries of the shepherd?
> >>>
> >>>
> >>> frank
> >>>
> >>> ----------------------------------------------------------------
> --
> >> ---
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: users-
> help at gridengine.sunsource.net>>
> >> -----------------------------------------------------------------
> ---
> >> -
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-
> help at gridengine.sunsource.net>>
> >
> > ------------------------------------------------------------------
> ---
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> --------------------------------------------------------------------
> -
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list