[GE users] What happens after a qdel?
Ansgar.Esztermann at mpi-bpc.mpg.de
Wed Oct 6 13:55:46 BST 2010
[ The following text is in the "utf-8" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some characters may be displayed incorrectly. ]
>> However, the processes on the other nodes seem to get a SIGKILL. Is this a feature of SGE?
> Yes, as long as it's a tightly integrated job and don't start the slaves by a traditional ssh/rsh but through SGE's `qrsh -inherit ...` so that SGE is this way aware of the existence of slave porcesses. To check this you can do:
Yes, that's exactly what we are doing.
> Nevertheless some parallel libraries shut down automatically, when the master process is gone. This would then work also when the jobs are not tightly integrated into SGE, but you would have a wrong accounting.
In our case, the application gets a SIGTERM (from the PE, I guess) and initiates its own shut-down sequence. However, before it can complete this, the processes on the slave nodes are gone. Then, the master node processes will busy-wait for MPI messages from the non-existent processes.
> You set up MPCH2 with tight integration for mpd-starttup method? With the upcomining Hydra as startup method in MPICH2, it will be much easier as the tight integration is built already into MPICH2.
We are not actually using MPICH2, but IntelMPI. It is MPICH2-based, but it might take some time until upstream features make it into the final product.
>> Is it configurable?
> There is an entry "terminate_method" in the queue definition, which could do other things than killing by process group (default) or additional group id (configurable).
Oh, I hadn't seen that. Currently, it's set to NONE, which probably explains why processes on the master node do not get a SIGKILL.
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users