[GE users] Re: [GE SGE-6.0u4: Job Suspend does not work for child processes.

Reuti reuti at staff.uni-marburg.de
Mon Mar 12 15:07:21 GMT 2007


    [ The following text is in the "WINDOWS-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Mark,

the suspension isn't supported for parallel jobs in SGE by design.  
But anyway: AFAIK only the master-node will get the signals  
delivered. What the application is doing with these signals depends  
on the implementation. The default für usr2 (on Linux) is to  
terminate the application, but there seems to be already somthing in  
OpenMPI to catch the signal. Is there anything in the OpenMPI  
documentation? Is OpenMPI handling suspension, or is usr2 meaning  
something special to OpenMPI - like restart the daemons or so?

-- Reuti


Am 12.03.2007 um 15:13 schrieb Olesen, Mark:

> Hi Reuti,
>
>> the signals are send to the complete processgroup, so they must be
>> caught in the shell and the program/threads.
>
> I'm still not sure why the shell trap doesn't seem to be working,  
> but maybe
> we can address that later.
>
> What I do find interesting is that sending the signal to the  
> process group
> lead to the error message about the daemon failing to start  
> (strange, it is
> already running). It unfortunately also leaves the mpirun hanging  
> about.
>
> I also tested killing the process group by hand, or killing just  
> the mpirun
> itself
>
> Here is an example of the processes (cut down to the essentials):
>
> # /bin/ps -e f -o user,pid,ppid,pgrp,command
>
> USER       PID  PPID  PGRP COMMAND
> cfdadmin  5449     1  5449 .../sge_execd
> cfdadmin  5641  5449  5449  \_ /bin/sh .../qloadsensor
> cfdadmin 17383  5449 17383  \_ sge_shepherd-39520 -bg
> olesenm  17397 17383 17397  |   \_ -sh .../spool/.../job_scripts/39520
> olesenm  18212 17397 17397  |       \_ mpirun APP
> olesenm  18213 18212 17397  |           \_ qrsh -inherit ...
> olesenm  18217 18213 17397  |           |   \_ /usr/bin/ssh -n ...
> olesenm  18214 18212 17397  |           \_ qrsh -inherit ...  
> olesenm  18218
> 18214 17397  |               \_ /usr/bin/ssh -n ...
> cfdadmin 18215  5449 18215  \_ sge_shepherd-39520 -bg
> root     18216 18215 18216      \_ sshd: olesenm [priv]
> olesenm  18220 18216 18216          \_ sshd: olesenm at notty
> olesenm  18221 18220 18221              \_ .../qrsh_starter ...
> olesenm  18228 18221 18228                  \_ .../orted ...
> olesenm  18229 18228 18228                      \_ APP
> olesenm  18230 18228 18228                      \_ APP
>
>
> If I kill the process group
> # kill -12 -17397
>
> mpirun: Forwarding signal 12 to job[dealc12:18212] ERROR: A daemon  
> on node
> dealc12 failed to start as expected.
> [dealc12:18212] ERROR: There may be more information available from
> [dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine  
> tasks.
> [dealc12:18212] ERROR: If the problem persists, please restart the
> [dealc12:18212] ERROR: Grid Engine PE job
> [dealc12:18212] The daemon received a signal 12.
> [dealc12:18212] ERROR: A daemon on node dealc20 failed to start as  
> expected.
> [dealc12:18212] ERROR: There may be more information available from
> [dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine  
> tasks.
> [dealc12:18212] ERROR: If the problem persists, please restart the
> [dealc12:18212] ERROR: Grid Engine PE job
> [dealc12:18212] The daemon received a signal 12.
>
> After waiting some time, the process table shows that the mpirun  
> for this
> process group is still hanging about:
>
> USER       PID  PPID  PGRP COMMAND
> olesenm  18212     1 17397 mpirun APP
>
> The error message about starting a daemon on a running job seems a  
> bit odd.
> Why is it trying to start a new daemon?
>
> As an alternative, I also tried sending the signal directly to the  
> mpirun
> instead and not the process group. This yields the following message:
>
> mpirun: Forwarding signal 12 to jobmpirun noticed that job rank 0  
> with PID
> 6616 on node dealc20 exited on signal 12 (User defined signal 2).
> 3 additional processes aborted (not shown)
>
> After this, there is no mpirun or application left running.
> This reflects my expectations a bit better, but is obviously not  
> the way it
> is supposed to be used.
> Does anyone else have experience with openmpi, or should I direct  
> this to
> the openmpi list(s).
>
> /mark
>
> This e-mail message and any attachments may contain legally  
> privileged, confidential or proprietary Information, or information  
> otherwise protected by law of ArvinMeritor, Inc., its affiliates,  
> or third parties. This notice serves as marking of its  
> ?Confidential? status as defined in any confidentiality agreements  
> concerning the sender and recipient. If you are not the intended  
> recipient(s), or the employee or agent responsible for delivery of  
> this message to the intended recipient(s), you are hereby notified  
> that any dissemination, distribution or copying of this e-mail  
> message is strictly prohibited. If you have received this message  
> in error, please immediately notify the sender and delete this e- 
> mail message from your computer.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list