[GE users] Rogue MPI processes even with tight integration

Reuti reuti at staff.uni-marburg.de
Mon Sep 3 13:37:15 BST 2007


Hi,

Am 03.09.2007 um 11:22 schrieb Chris Rudge:

> I'm having trouble with MPI processes left running after a job has
> exited. I'm using mpich-mx and have set up tight integration. I'm  
> fairly
> confident the tight integration is working as every process of the job
> is accounted for which didn't happen before I set this up.
>
> My understanding of tight integration is that it should deal with the
> accounting (as it does) and also allow SGE to properly track, and
> destroy, all processes of the job.
>
> An example of things failing is a job which exceeded it's walltime
> (h_rt) limit, the job has been deleted but some of the processes on
> slave nodes of the job are still running. This was an 8 process job  
> with
> processes spread across nodes comp52,53,54,55.
>
> I've given some qacct and process information below. Can anyone help
> with getting SGE to track, and kill, MPI processes properly.
>
> Regards,
> Chris
>
> <snip>
>
> There are still processes running on comp53 and comp55
>
> comp53:~ # ps -ef | grep aph11
> aph11     9743     1  0 Aug07 ?        00:00:00 /usr/local/sge6.0/ 
> utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/ 
> comp53/active_jobs/860854.1/2.comp53
> aph11     9769  9743 98 Aug07 ?        26-10:10:20 /home/tag/aph11/ 
> P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
> aph11     9770     1  0 Aug07 ?        00:00:00 /usr/local/sge6.0/ 
> utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/ 
> comp53/active_jobs/860854.1/3.comp53
> aph11     9814  9770 99 Aug07 ?        26-14:44:59 /home/tag/aph11/ 
> P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
> aph11     9815  9769  0 Aug07 ?        00:00:00 /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
> aph11     9816  9815  0 Aug07 ?        00:01:03 /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
> aph11     9837     1  0 Aug07 ?        00:00:00 /usr/local/sge6.0/ 
> utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/ 
> comp53/active_jobs/860854.1/4.comp53
> aph11     9838  9814  0 Aug07 ?        00:00:00 /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
> aph11     9839  9838  0 Aug07 ?        00:00:28 /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
> aph11     9863  9837 98 Aug07 ?        26-11:14:29 /home/tag/aph11/ 
> P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
> aph11     9884  9863  0 Aug07 ?        00:00:00 /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
> aph11     9885  9884  0 Aug07 ?        00:00:20 /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
>
>
>
> comp55:~ # ps -ef | grep aph11
> aph11     3788     1  0 Aug07 ?        00:00:00 /usr/local/sge6.0/ 
> utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/ 
> comp55/active_jobs/860854.1/2.comp55
> aph11     3812  3788 99 Aug07 ?        26-18:09:00 /home/tag/aph11/ 
> P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
> aph11     3834  3812  0 Aug07 ?        00:00:00 /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
> aph11     3835  3834  0 Aug07 ?        00:00:19 /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt

can you please post this of a running job using (space before f) from  
the master node of the mpi job and the slave node:

ps -e f -o pid,ppid,pgrp,command --cols=500

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list