[GE users] Rogue MPI processes even with tight integration

Chris Rudge chris.rudge at astro.le.ac.uk
Mon Sep 3 10:22:44 BST 2007


Hi,

I'm having trouble with MPI processes left running after a job has
exited. I'm using mpich-mx and have set up tight integration. I'm fairly
confident the tight integration is working as every process of the job
is accounted for which didn't happen before I set this up.

My understanding of tight integration is that it should deal with the
accounting (as it does) and also allow SGE to properly track, and
destroy, all processes of the job.

An example of things failing is a job which exceeded it's walltime
(h_rt) limit, the job has been deleted but some of the processes on
slave nodes of the job are still running. This was an 8 process job with
processes spread across nodes comp52,53,54,55.

I've given some qacct and process information below. Can anyone help
with getting SGE to track, and kill, MPI processes properly.

Regards,
Chris


 # qacct -j 860854
==============================================================
qname        mpi.q               
hostname     comp52.
group        UNKNOWN             
owner        aph11               
project      tag                 
department   tag                 
jobname      clouds_mass3_0.3    
jobnumber    860854              
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Thu Jan  1 01:00:00 1970
start_time   Tue Aug  7 15:10:00 2007
end_time     Tue Aug 28 11:09:54 2007
granted_pe   mpich-mx            
slots        8                   
failed       100 : assumedly after job
exit_status  137                 
ru_wallclock 1799994      
ru_utime     1784702      
ru_stime     1025         
ru_maxrss    0                   
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    49306137            
ru_majflt    14                  
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   0                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     1456291             
ru_nivcsw    27803676            
cpu          1785776      
mem          527761.903        
io           0.000             
iow          0.000             
maxvmem      1.000G
==============================================================
qname        mpi.q               
hostname     comp54.
group        UNKNOWN             
owner        aph11               
project      tag                 
department   tag                 
jobname      clouds_mass3_0.3    
jobnumber    860854              
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Thu Jan  1 01:00:00 1970
start_time   Tue Aug  7 15:10:00 2007
end_time     Tue Aug 28 11:09:55 2007
granted_pe   mpich-mx            
slots        8                   
failed       100 : assumedly after job
exit_status  137                 
ru_wallclock 1799995      
ru_utime     1797017      
ru_stime     435          
ru_maxrss    0                   
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    54590619            
ru_majflt    5                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   0                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     472569              
ru_nivcsw    12832548            
cpu          1797475      
mem          530395.325        
io           0.000             
iow          0.000             
maxvmem      1023.945M
==============================================================
qname        mpi.q               
hostname     comp53.
group        UNKNOWN             
owner        aph11               
project      tag                 
department   tag                 
jobname      clouds_mass3_0.3    
jobnumber    860854              
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Thu Jan  1 01:00:00 1970
start_time   Tue Aug  7 15:10:02 2007
end_time     Tue Aug 28 11:09:56 2007
granted_pe   mpich-mx            
slots        8                   
failed       100 : assumedly after job
exit_status  129                 
ru_wallclock 1799994      
ru_utime     0            
ru_stime     0            
ru_maxrss    0                   
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    809                 
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   0                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     551                 
ru_nivcsw    7                   
cpu          1794260      
mem          535992.025        
io           0.000             
iow          0.000             
maxvmem      1.009G
==============================================================
qname        mpi.q               
hostname     comp53.
group        UNKNOWN             
owner        aph11               
project      tag                 
department   tag                 
jobname      clouds_mass3_0.3    
jobnumber    860854              
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Thu Jan  1 01:00:00 1970
start_time   Tue Aug  7 15:10:01 2007
end_time     Tue Aug 28 11:09:56 2007
granted_pe   mpich-mx            
slots        8                   
failed       100 : assumedly after job
exit_status  129                 
ru_wallclock 1799995      
ru_utime     0            
ru_stime     0            
ru_maxrss    0                   
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    809                 
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   0                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     574                 
ru_nivcsw    9                   
cpu          1797147      
mem          536452.052        
io           0.000             
iow          0.000             
maxvmem      1.009G
==============================================================
qname        mpi.q               
hostname     comp53.
group        UNKNOWN             
owner        aph11               
project      tag                 
department   tag                 
jobname      clouds_mass3_0.3    
jobnumber    860854              
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Thu Jan  1 01:00:00 1970
start_time   Tue Aug  7 15:10:01 2007
end_time     Tue Aug 28 11:09:56 2007
granted_pe   mpich-mx            
slots        8                   
failed       100 : assumedly after job
exit_status  129                 
ru_wallclock 1799995      
ru_utime     0            
ru_stime     0            
ru_maxrss    0                   
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    809                 
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   0                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     563                 
ru_nivcsw    6                   
cpu          1796885      
mem          536547.283        
io           0.000             
iow          0.000             
maxvmem      1.009G
==============================================================
qname        mpi.q               
hostname     comp53.
group        UNKNOWN             
owner        aph11               
project      tag                 
department   tag                 
jobname      clouds_mass3_0.3    
jobnumber    860854              
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Thu Jan  1 01:00:00 1970
start_time   Tue Aug  7 15:10:00 2007
end_time     Tue Aug 28 11:09:56 2007
granted_pe   mpich-mx            
slots        8                   
failed       100 : assumedly after job
exit_status  129                 
ru_wallclock 1799996      
ru_utime     0            
ru_stime     0            
ru_maxrss    0                   
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    809                 
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   0                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     588                 
ru_nivcsw    22                  
cpu          1797499      
mem          536807.154        
io           0.000             
iow          0.000             
maxvmem      1.009G
==============================================================
qname        mpi.q               
hostname     comp55.
group        UNKNOWN             
owner        aph11               
project      tag                 
department   tag                 
jobname      clouds_mass3_0.3    
jobnumber    860854              
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Thu Jan  1 01:00:00 1970
start_time   Tue Aug  7 15:10:01 2007
end_time     Tue Aug 28 11:09:58 2007
granted_pe   mpich-mx            
slots        8                   
failed       100 : assumedly after job
exit_status  129                 
ru_wallclock 1799997      
ru_utime     0            
ru_stime     0            
ru_maxrss    0                   
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    809                 
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   0                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     525                 
ru_nivcsw    9                   
cpu          1796485      
mem          532394.681        
io           0.000             
iow          0.000             
maxvmem      1.003G
==============================================================
qname        mpi.q               
hostname     comp55.
group        UNKNOWN             
owner        aph11               
project      tag                 
department   tag                 
jobname      clouds_mass3_0.3    
jobnumber    860854              
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Thu Jan  1 01:00:00 1970
start_time   Tue Aug  7 15:10:00 2007
end_time     Tue Aug 28 11:09:58 2007
granted_pe   mpich-mx            
slots        8                   
failed       100 : assumedly after job
exit_status  129                 
ru_wallclock 1799998      
ru_utime     0            
ru_stime     0            
ru_maxrss    0                   
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    809                 
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   0                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     579                 
ru_nivcsw    9                   
cpu          1794346      
mem          531650.892        
io           0.000             
iow          0.000             
maxvmem      1.003G
==============================================================
qname        mpi.q               
hostname     comp55.
group        UNKNOWN             
owner        aph11               
project      tag                 
department   tag                 
jobname      clouds_mass3_0.3    
jobnumber    860854              
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Tue Aug  7 15:09:44 2007
start_time   Tue Aug  7 15:09:57 2007
end_time     Tue Aug 28 11:09:58 2007
granted_pe   mpich-mx            
slots        8                   
failed       100 : assumedly after job
exit_status  137                 
ru_wallclock 1800001      
ru_utime     0            
ru_stime     0            
ru_maxrss    0                   
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    6582                
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   0                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     1016                
ru_nivcsw    61                  
cpu          118          
mem          2.425             
io           0.000             
iow          0.000             
maxvmem      404.578M



There are still processes running on comp53 and comp55



comp53:~ # ps -ef | grep aph11
aph11     9743     1  0 Aug07 ?        00:00:00 /usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/comp53/active_jobs/860854.1/2.comp53
aph11     9769  9743 98 Aug07 ?        26-10:10:20 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
aph11     9770     1  0 Aug07 ?        00:00:00 /usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/comp53/active_jobs/860854.1/3.comp53
aph11     9814  9770 99 Aug07 ?        26-14:44:59 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
aph11     9815  9769  0 Aug07 ?        00:00:00 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
aph11     9816  9815  0 Aug07 ?        00:01:03 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
aph11     9837     1  0 Aug07 ?        00:00:00 /usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/comp53/active_jobs/860854.1/4.comp53
aph11     9838  9814  0 Aug07 ?        00:00:00 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
aph11     9839  9838  0 Aug07 ?        00:00:28 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
aph11     9863  9837 98 Aug07 ?        26-11:14:29 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
aph11     9884  9863  0 Aug07 ?        00:00:00 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
aph11     9885  9884  0 Aug07 ?        00:00:20 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt



comp55:~ # ps -ef | grep aph11
aph11     3788     1  0 Aug07 ?        00:00:00 /usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/comp55/active_jobs/860854.1/2.comp55
aph11     3812  3788 99 Aug07 ?        26-18:09:00 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
aph11     3834  3812  0 Aug07 ?        00:00:00 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt
aph11     3835  3834  0 Aug07 ?        00:00:19 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_0.3/mass3_0.3 input.txt







-- 
Dr Chris Rudge
chris.rudge at astro.le.ac.uk

UKAFF Facility Manager & Dept. Research Computing Manager
Dept of Physics & Astronomy
University of Leicester
LE1 7RH

web.  www.ukaff.ac.uk
Tel.  +44 (0)116 2523331
Fax.  +44 (0)116 2231283
Mob.  +44 (0)794 1379420


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list