[GE users] MPICH2 job deletion

Reuti reuti at staff.uni-marburg.de
Wed Apr 27 16:09:13 BST 2005


Hi Alan,

you set up SGE to use ssh in it's config, and/or did you just avoid 
setting "MPIEXEC_RSH=rsh"?

CU - Reuti


Alan Carriou wrote:
> Hi
> 
> On our grid, we have SGE 6.0u3 and MPICH2 1.0.1.
> Using the smpd daemonless startup, we have a problem : when we delete a 
> running MPI-job, the MPI processes are not killed.
> The slots are freed, the job is said to be finished, the mpiexec and ssh 
> processes on the first node are killed, but the MPI processes themselves 
> are still alive. This happens both with qdel and qmon. The 
> qmaster/messages says just:
> 
> 04/27/2005 15:49:07|qmaster|testgrid-3|W|job 51.1 failed on host 
> testgrid-4.jet.uk assumedly after job because: job 51.1 died through 
> signal KILL (9)
> 
> If this may explain something, we use ssh instead of rsh to connect to 
> other hosts.
> 
> Using the daemon-based startup, the job deletion works fine. And, using 
> both, the normal end of a MPI-job causes no problem.
> 
> Does anyone have an idea ?
> 
> Thanks,
> Alan
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list