[GE users] What happens after a qdel?

aeszter Ansgar.Esztermann at mpi-bpc.mpg.de
Fri Oct 8 14:41:30 BST 2010

    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

On Oct 6, 2010, at 18:20 , aeszter wrote:

>> And please check whether this is working. If not, you can define:
> It isn't. The processes on the slave nodes are killed, but the ones on the master remain.

In the meantime, I've found out why:
-mpd calls setpgrp() to move mpdman to a process group of its own; likewise, mpdman moves the client process to a group of its own.
Thus, any attempt to kill the whole process tree below the shepherd by pgid is doomed to fail.
-when an mpd exits (even after SIGKILL), its mpdman processes will detect this and issue a SIGKILL to their clients (and to themselves). This mechanism removes the processes on the slave nodes. On the master node, the mpdman processes detect a ring failure (probably because of the killed processes on the other nodes, but that's just guesswork) and issue a SIGTERM (and only a SIGTERM) to the clients. In our case, the client application's attempt at performing a graceful shutdown involve MPI calls, which will then hang because of the broken mpd ring.
By the time the mpd exits, mpdman has long since terminated, so it cannot send the SIGKILL (as it did on the slave nodes).

>> terminate_method /some/path/killkids.sh $job_pid
>> from http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=280046 andhttp://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=280796

Thanks, that did it.


Ansgar Esztermann
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list