[GE users] MPICH process groups

Reuti reuti at staff.uni-marburg.de
Tue Dec 13 13:47:30 GMT 2005


Am 13.12.2005 um 09:31 schrieb Jeroen M. Kleijer:

>
> Ok, seems I've spoken too soon again.
> MPICH's mpirun does acknowledge the hostfile format "<node>:<cpus>"  
> as well as the machine file format "<node> <node> <node>". (made a  
> typo in the function PeHostFile2HostFile causing it to omit the ":"  
> and then MPICH thinks the number of cpus is a host as well)
> But the hanging semaphores still remain meaning a user still has to  
> run mpiclean on every node every time he issues a qdel command.


Yes, this is a little bit odd, as you would have to execute the  
cleanipcs on all nodes of the parallel job. Unfortunately, qrsh is  
not allowed in the stop script of the PE. The cleanest solution would  
be, to compile MPICH without shared memory support, as the advantage  
on dual headed nodes isn't so big (okay, depends on the application).  
And on a big SMP machine you could simply do a local cleanipcs, maybe  
with some logic inside to avoid the removal of still needed ones for  
other jobs of the same user. cleanipcs simply removes all of a user  
AFAIR.

As you have only the binaries - maybe you can ask the vendor to  
provide a different version.

For dynamically linked applications I had an idea, but I never put it  
to a final version, as for us we avoid the use of shared memory:

http://gridengine.sunsource.net/servlets/ReadMsg? 
listName=users&msgNo=9377


Another approach: (if MPICH is shutting down in a proper way if the  
program gets a SIGTERM) you could use a notify by setting the signal  
to be (in qconf -mconf):

execd_params   NOTIFY_KILL=TERM

instead of the default USR2.

qsub -notify ...

(But then I would also suggest to trap the signal in the jobscript,  
as otherwise the jobscript might kill the job, before the MPI-program  
made a safe shutdown on its own:

trap '' term

[this is two times '] )

-- Reuti



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list