[GE users] MPICH2 job deletion

Alan Carriou Alan.Carriou at jet.uk
Wed Apr 27 16:00:39 BST 2005


On our grid, we have SGE 6.0u3 and MPICH2 1.0.1.
Using the smpd daemonless startup, we have a problem : when we delete a 
running MPI-job, the MPI processes are not killed.
The slots are freed, the job is said to be finished, the mpiexec and ssh 
processes on the first node are killed, but the MPI processes themselves 
are still alive. This happens both with qdel and qmon. The 
qmaster/messages says just:

04/27/2005 15:49:07|qmaster|testgrid-3|W|job 51.1 failed on host 
testgrid-4.jet.uk assumedly after job because: job 51.1 died through 
signal KILL (9)

If this may explain something, we use ssh instead of rsh to connect to 
other hosts.

Using the daemon-based startup, the job deletion works fine. And, using 
both, the normal end of a MPI-job causes no problem.

Does anyone have an idea ?


To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list