[GE issues] [Issue 3280] New - Daemonized job processes will not be killed on exceeding a ressource limit

lindig michael.lindig at informatik.tu-chemnitz.de
Wed Aug 18 09:33:30 BST 2010


http://gridengine.sunsource.net/issues/show_bug.cgi?id=3280
                 Issue #|3280
                 Summary|Daemonized job processes will not be killed on exceedi
                        |ng a ressource limit
               Component|gridengine
                 Version|6.2u4
                Platform|PC
                     URL|
              OS/Version|Linux
                  Status|NEW
       Status whiteboard|
                Keywords|
              Resolution|
              Issue type|DEFECT
                Priority|P3
            Subcomponent|execution
             Assigned to|pollinger
             Reported by|lindig






------- Additional comments from lindig at sunsource.net Wed Aug 18 01:33:29 -0700 2010 -------
Hi,

we have some user-jobs which does a daemonization (with perl) of sub-processes, on exceeding the job limits the job will be killed by sge 
but the daemonized processes keep running :(.

I think the killing of the job should be done by checking the entire process tree ('ps -Hle'). In addition to kill the parent job process 
it should be also checked if there are some child processes which are not killed by this signal.

Here an example (it's a qrsh interactive session), the first one if all is OK:

F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
5 S   400  5307     1  0  77   0 - 13884 -      ?        00:05:18   sge_execd
4 S   400 13798  5307  0  78   0 -  3789 wait   ?        00:00:00     sge_shepherd
4 S     0 13800 13798  0  77   0 -  9357 -      ?        00:00:00       sshd
5 S   368 13802 13800  0  81   5 -  9758 -      ?        00:00:01         sshd
0 S   368 13803 13802  0  81   5 - 13831 -      pts/0    00:00:00           tcsh
0 S   368 13871 13803  0  82   5 - 20594 wait   pts/0    00:00:00             perl
0 S   368 13926 13871  0  80   5 - 112436 -     pts/0    00:00:23               SolverManager.e
0 S   368 13988 13926  0  82   5 - 20563 pipe_w pts/0    00:00:00                 perl
0 S   368 14044 13926  0  82   5 - 28323 wait   ?        00:00:03                 perl
0 S   368 16829 14044  0  80   5 -  2976 -      ?        00:00:00                   mpirun
0 S   368 16838 16829  0  80   5 -  8598 -      ?        00:00:00                     mpid
0 R   368 16866 16838 99  90   5 - 277070 -     ?        01:13:09                       solver-hpmpi.ex
0 R   368 16867 16838 99  90   5 - 265205 -     ?        01:13:44                       solver-hpmpi.ex
0 R   368 16868 16838 99  90   5 - 228104 -     ?        01:13:44                       solver-hpmpi.ex
0 R   368 16869 16838 99  90   5 - 229421 -     ?        01:13:43                       solver-hpmpi.ex
0 R   368 16870 16838 99  90   5 - 264099 -     ?        01:13:44                       solver-hpmpi.ex
0 R   368 16871 16838 99  90   5 - 234746 -     ?        01:13:44                       solver-hpmpi.ex
0 R   368 16872 16838 99  90   5 - 238574 -     ?        01:13:44                       solver-hpmpi.ex
0 R   368 16873 16838 99  90   5 - 234029 -     ?        01:13:44                       solver-hpmpi.ex
0 R   368 16874 16838 99  90   5 - 232440 -     ?        01:13:42                       solver-hpmpi.ex
0 R   368 16875 16838 99  90   5 - 245918 -     ?        01:13:41                       solver-hpmpi.ex
0 R   368 16876 16838 99  90   5 - 231869 -     ?        01:13:44                       solver-hpmpi.ex
0 R   368 16877 16838 99  90   5 - 233679 -     ?        01:13:44                       solver-hpmpi.ex

now after h_rt is running out and GE kills the job, we have this situation:

F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
0 S   368 14044     1  0  82   5 - 28323 wait   ?        00:00:03   perl
0 S   368 16829 14044  0  80   5 -  2976 -      ?        00:00:00     mpirun
0 S   368 16838 16829  0  80   5 -  8598 -      ?        00:00:00       mpid
0 R   368 16866 16838 99  90   5 - 277070 -     ?        01:45:05         solver-hpmpi.ex
0 R   368 16867 16838 99  90   5 - 265205 -     ?        01:45:46         solver-hpmpi.ex
0 R   368 16868 16838 99  90   5 - 228104 -     ?        01:45:46         solver-hpmpi.ex
0 R   368 16869 16838 99  90   5 - 229421 -     ?        01:45:46         solver-hpmpi.ex
0 R   368 16870 16838 99  90   5 - 264099 -     ?        01:45:46         solver-hpmpi.ex
0 R   368 16871 16838 99  90   5 - 234746 -     ?        01:45:46         solver-hpmpi.ex
0 R   368 16872 16838 99  90   5 - 238574 -     ?        01:45:46         solver-hpmpi.ex
0 R   368 16873 16838 99  90   5 - 234029 -     ?        01:45:46         solver-hpmpi.ex
0 R   368 16874 16838 99  90   5 - 232440 -     ?        01:45:45         solver-hpmpi.ex
0 R   368 16875 16838 99  90   5 - 245918 -     ?        01:45:43         solver-hpmpi.ex
0 R   368 16876 16838 99  90   5 - 231869 -     ?        01:45:46         solver-hpmpi.ex
0 R   368 16877 16838 99  90   5 - 233679 -     ?        01:45:46         solver-hpmpi.ex

as you can see the init-process is now the parent process.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=36&dsMessageId=275143

To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list