[GE users] MPICH process groups
reuti at staff.uni-marburg.de
Tue Dec 13 13:47:30 GMT 2005
Am 13.12.2005 um 09:31 schrieb Jeroen M. Kleijer:
> Ok, seems I've spoken too soon again.
> MPICH's mpirun does acknowledge the hostfile format "<node>:<cpus>"
> as well as the machine file format "<node> <node> <node>". (made a
> typo in the function PeHostFile2HostFile causing it to omit the ":"
> and then MPICH thinks the number of cpus is a host as well)
> But the hanging semaphores still remain meaning a user still has to
> run mpiclean on every node every time he issues a qdel command.
Yes, this is a little bit odd, as you would have to execute the
cleanipcs on all nodes of the parallel job. Unfortunately, qrsh is
not allowed in the stop script of the PE. The cleanest solution would
be, to compile MPICH without shared memory support, as the advantage
on dual headed nodes isn't so big (okay, depends on the application).
And on a big SMP machine you could simply do a local cleanipcs, maybe
with some logic inside to avoid the removal of still needed ones for
other jobs of the same user. cleanipcs simply removes all of a user
As you have only the binaries - maybe you can ask the vendor to
provide a different version.
For dynamically linked applications I had an idea, but I never put it
to a final version, as for us we avoid the use of shared memory:
Another approach: (if MPICH is shutting down in a proper way if the
program gets a SIGTERM) you could use a notify by setting the signal
to be (in qconf -mconf):
instead of the default USR2.
qsub -notify ...
(But then I would also suggest to trap the signal in the jobscript,
as otherwise the jobscript might kill the job, before the MPI-program
made a safe shutdown on its own:
trap '' term
[this is two times '] )
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users