[GE users] Re: [GE SGE-6.0u4: Job Suspend does not work for child processes.

Olesen, Mark Mark.Olesen at arvinmeritor.com
Mon Mar 12 14:13:12 GMT 2007


    [ The following text is in the "X-UNKNOWN" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi Reuti,

> the signals are send to the complete processgroup, so they must be
> caught in the shell and the program/threads.

I'm still not sure why the shell trap doesn't seem to be working, but maybe
we can address that later.

What I do find interesting is that sending the signal to the process group
lead to the error message about the daemon failing to start (strange, it is
already running). It unfortunately also leaves the mpirun hanging about.

I also tested killing the process group by hand, or killing just the mpirun
itself

Here is an example of the processes (cut down to the essentials):

# /bin/ps -e f -o user,pid,ppid,pgrp,command

USER       PID  PPID  PGRP COMMAND
cfdadmin  5449     1  5449 .../sge_execd
cfdadmin  5641  5449  5449  \_ /bin/sh .../qloadsensor
cfdadmin 17383  5449 17383  \_ sge_shepherd-39520 -bg
olesenm  17397 17383 17397  |   \_ -sh .../spool/.../job_scripts/39520
olesenm  18212 17397 17397  |       \_ mpirun APP
olesenm  18213 18212 17397  |           \_ qrsh -inherit ...
olesenm  18217 18213 17397  |           |   \_ /usr/bin/ssh -n ...
olesenm  18214 18212 17397  |           \_ qrsh -inherit ... olesenm  18218
18214 17397  |               \_ /usr/bin/ssh -n ...
cfdadmin 18215  5449 18215  \_ sge_shepherd-39520 -bg
root     18216 18215 18216      \_ sshd: olesenm [priv]
olesenm  18220 18216 18216          \_ sshd: olesenm at notty
olesenm  18221 18220 18221              \_ .../qrsh_starter ...
olesenm  18228 18221 18228                  \_ .../orted ...
olesenm  18229 18228 18228                      \_ APP
olesenm  18230 18228 18228                      \_ APP


If I kill the process group
# kill -12 -17397

mpirun: Forwarding signal 12 to job[dealc12:18212] ERROR: A daemon on node
dealc12 failed to start as expected.
[dealc12:18212] ERROR: There may be more information available from
[dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[dealc12:18212] ERROR: If the problem persists, please restart the
[dealc12:18212] ERROR: Grid Engine PE job
[dealc12:18212] The daemon received a signal 12.
[dealc12:18212] ERROR: A daemon on node dealc20 failed to start as expected.
[dealc12:18212] ERROR: There may be more information available from
[dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[dealc12:18212] ERROR: If the problem persists, please restart the
[dealc12:18212] ERROR: Grid Engine PE job
[dealc12:18212] The daemon received a signal 12.

After waiting some time, the process table shows that the mpirun for this
process group is still hanging about:

USER       PID  PPID  PGRP COMMAND
olesenm  18212     1 17397 mpirun APP

The error message about starting a daemon on a running job seems a bit odd.
Why is it trying to start a new daemon?

As an alternative, I also tried sending the signal directly to the mpirun
instead and not the process group. This yields the following message:

mpirun: Forwarding signal 12 to jobmpirun noticed that job rank 0 with PID
6616 on node dealc20 exited on signal 12 (User defined signal 2).
3 additional processes aborted (not shown)

After this, there is no mpirun or application left running.
This reflects my expectations a bit better, but is obviously not the way it
is supposed to be used.
Does anyone else have experience with openmpi, or should I direct this to
the openmpi list(s).

/mark

This e-mail message and any attachments may contain legally privileged, confidential or proprietary Information, or information otherwise protected by law of ArvinMeritor, Inc., its affiliates, or third parties. This notice serves as marking of its ?Confidential? status as defined in any confidentiality agreements concerning the sender and recipient. If you are not the intended recipient(s), or the employee or agent responsible for delivery of this message to the intended recipient(s), you are hereby notified that any dissemination, distribution or copying of this e-mail message is strictly prohibited. If you have received this message in error, please immediately notify the sender and delete this e-mail message from your computer.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list