[GE users] Rogue MPI processes even with tight integration

Reuti reuti at staff.uni-marburg.de
Mon Sep 3 20:59:44 BST 2007


Well, this looks perfect. All kids of the qrsh_starter have the same  
group id. Are you experiencing this also if you issue a qdel? There  
was a race condition for h_cpu, but even then the qrsh_starter  
disappeared and only an idling process was left. With h_rt this is  
something I never saw before with this behavior.

Did you apply the Myrinet scripts from the mpi folder (which I  
wouldn't suggest to use with the latest version of the Myrinet  
software)?

-- Reuti


Am 03.09.2007 um 14:47 schrieb Chris Rudge:

> On Mon, 2007-09-03 at 14:37 +0200, Reuti wrote:
>
>> ps -e f -o pid,ppid,pgrp,command --cols=500
>>
>
> On the master:
>
> 23665     1 23665 /usr/local/sge6.0/bin/lx24-amd64/sge_execd
>  9278 23665  9278  \_ sge_shepherd-861188 -bg
>  9292  9278  9292  |   \_ /bin/csh /usr/local/sge6.0/default/spool/ 
> comp63/job_scripts/861188
>  9316  9292  9292  |       \_ perl -S -w /usr/local/mpich-mx/path/ 
> bin/mpirun.ch_mx.pl --mx-kill 15 -r -np 8 -machinefile /tmp/mpi/ 
> 861188.1.mpi.q/machines /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb "input.txt"
>  9353  9316  9292  |           \_ perl -S -w /usr/local/mpich-mx/ 
> path/bin/mpirun.ch_mx.pl --mx-kill 15 -r -np 8 -machinefile /tmp/ 
> mpi/861188.1.mpi.q/machines /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb "input.txt"
>  9354  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/ 
> qrsh -inherit comp54.star.le.ac.uk cd /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63  
> MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1  
> MXMPI_MAGIC=5984009 MXMPI_ID=0 MXMPI_NP=8 MXMPI_BOARD=-1  
> MXMPI_SLAVE=192.168.1.154 /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb "input.txt"
>  9427  9354  9292  |           |   \_ /usr/local/sge6.0/utilbin/ 
> lx24-amd64/rsh -p 49466 comp54.star.le.ac.uk exec '/usr/local/ 
> sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/ 
> spool/comp54/active_jobs/861188.1/1.comp54'
>  9430  9427  9292  |           |       \_ [rsh] <defunct>
>  9355  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/ 
> qrsh -inherit -nostdin comp54.star.le.ac.uk cd /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63  
> MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1  
> MXMPI_MAGIC=5984009 MXMPI_ID=1 MXMPI_NP=8 MXMPI_BOARD=-1  
> MXMPI_SLAVE=192.168.1.154 /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb "input.txt"
>  9518  9355  9292  |           |   \_ /usr/local/sge6.0/utilbin/ 
> lx24-amd64/rsh -n -p 49480 comp54.star.le.ac.uk exec '/usr/local/ 
> sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/ 
> spool/comp54/active_jobs/861188.1/3.comp54'
>  9356  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/ 
> qrsh -inherit -nostdin comp54.star.le.ac.uk cd /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63  
> MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1  
> MXMPI_MAGIC=5984009 MXMPI_ID=2 MXMPI_NP=8 MXMPI_BOARD=-1  
> MXMPI_SLAVE=192.168.1.154 /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb "input.txt"
>  9456  9356  9292  |           |   \_ /usr/local/sge6.0/utilbin/ 
> lx24-amd64/rsh -n -p 49473 comp54.star.le.ac.uk exec '/usr/local/ 
> sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/ 
> spool/comp54/active_jobs/861188.1/2.comp54'
>  9357  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/ 
> qrsh -inherit -nostdin comp58.star.le.ac.uk cd /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63  
> MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1  
> MXMPI_MAGIC=5984009 MXMPI_ID=3 MXMPI_NP=8 MXMPI_BOARD=-1  
> MXMPI_SLAVE=192.168.1.158 /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb "input.txt"
>  9426  9357  9292  |           |   \_ /usr/local/sge6.0/utilbin/ 
> lx24-amd64/rsh -n -p 52516 comp58.star.le.ac.uk exec '/usr/local/ 
> sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/ 
> spool/comp58/active_jobs/861188.1/1.comp58'
>  9358  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/ 
> qrsh -inherit -nostdin comp63.star.le.ac.uk cd /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63  
> MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1  
> MXMPI_MAGIC=5984009 MXMPI_ID=4 MXMPI_NP=8 MXMPI_BOARD=-1  
> MXMPI_SLAVE=192.168.1.163 /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb "input.txt"
>  9455  9358  9292  |           |   \_ /usr/local/sge6.0/utilbin/ 
> lx24-amd64/rsh -n -p 34876 comp63.star.le.ac.uk exec '/usr/local/ 
> sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/ 
> spool/comp63/active_jobs/861188.1/2.comp63'
>  9359  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/ 
> qrsh -inherit -nostdin comp63.star.le.ac.uk cd /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63  
> MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1  
> MXMPI_MAGIC=5984009 MXMPI_ID=5 MXMPI_NP=8 MXMPI_BOARD=-1  
> MXMPI_SLAVE=192.168.1.163 /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb "input.txt"
>  9504  9359  9292  |           |   \_ /usr/local/sge6.0/utilbin/ 
> lx24-amd64/rsh -n -p 34887 comp63.star.le.ac.uk exec '/usr/local/ 
> sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/ 
> spool/comp63/active_jobs/861188.1/3.comp63'
>  9360  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/ 
> qrsh -inherit -nostdin comp63.star.le.ac.uk cd /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63  
> MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1  
> MXMPI_MAGIC=5984009 MXMPI_ID=6 MXMPI_NP=8 MXMPI_BOARD=-1  
> MXMPI_SLAVE=192.168.1.163 /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb "input.txt"
>  9428  9360  9292  |           |   \_ /usr/local/sge6.0/utilbin/ 
> lx24-amd64/rsh -n -p 34867 comp63.star.le.ac.uk exec '/usr/local/ 
> sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/ 
> spool/comp63/active_jobs/861188.1/1.comp63'
>  9361  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/ 
> qrsh -inherit -nostdin comp63.star.le.ac.uk cd /home/tag/aph11/P- 
> Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63  
> MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1  
> MXMPI_MAGIC=5984009 MXMPI_ID=7 MXMPI_NP=8 MXMPI_BOARD=-1  
> MXMPI_SLAVE=192.168.1.163 /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb "input.txt"
>  9553  9361  9292  |               \_ /usr/local/sge6.0/utilbin/ 
> lx24-amd64/rsh -n -p 34901 comp63.star.le.ac.uk exec '/usr/local/ 
> sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/ 
> spool/comp63/active_jobs/861188.1/4.comp63'
>  9418 23665  9418  \_ sge_shepherd-861188 -bg
>  9419  9418  9419  |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rshd -l
>  9429  9419  9429  |       \_ /usr/local/sge6.0/utilbin/lx24-amd64/ 
> qrsh_starter /usr/local/sge6.0/default/spool/comp63/active_jobs/ 
> 861188.1/1.comp63
>  9454  9429  9454  |           \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>  9478  9454  9454  |               \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>  9479  9478  9454  |                   \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>  9420 23665  9420  \_ sge_shepherd-861188 -bg
>  9421  9420  9421  |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rshd -l
>  9477  9421  9477  |       \_ /usr/local/sge6.0/utilbin/lx24-amd64/ 
> qrsh_starter /usr/local/sge6.0/default/spool/comp63/active_jobs/ 
> 861188.1/2.comp63
>  9503  9477  9503  |           \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>  9527  9503  9503  |               \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>  9528  9527  9503  |                   \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>  9422 23665  9422  \_ sge_shepherd-861188 -bg
>  9423  9422  9423  |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rshd -l
>  9526  9423  9526  |       \_ /usr/local/sge6.0/utilbin/lx24-amd64/ 
> qrsh_starter /usr/local/sge6.0/default/spool/comp63/active_jobs/ 
> 861188.1/3.comp63
>  9552  9526  9552  |           \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>  9595  9552  9552  |               \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>  9596  9595  9552  |                   \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>  9424 23665  9424  \_ sge_shepherd-861188 -bg
>  9425  9424  9425      \_ /usr/local/sge6.0/utilbin/lx24-amd64/rshd -l
>  9555  9425  9555          \_ /usr/local/sge6.0/utilbin/lx24-amd64/ 
> qrsh_starter /usr/local/sge6.0/default/spool/comp63/active_jobs/ 
> 861188.1/4.comp63
>  9600  9555  9600              \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>  9621  9600  9600                  \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>  9622  9621  9600                      \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>
>
>
>
> and on a slave (yes, it's correct that there are only 3 processes
> running on this slave)
>
> 15981     1 15981 /usr/local/sge6.0/bin/lx24-amd64/sge_execd
>   320 15981   320  \_ sge_shepherd-861188 -bg
>   321   320   321  |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rshd -l
>   326   321   326  |       \_ /usr/local/sge6.0/utilbin/lx24-amd64/ 
> qrsh_starter /usr/local/sge6.0/default/spool/comp54/active_jobs/ 
> 861188.1/1.comp54
>   350   326   350  |           \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>   372   350   350  |               \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>   373   372   350  |                   \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>   322 15981   322  \_ sge_shepherd-861188 -bg
>   323   322   323  |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rshd -l
>   371   323   371  |       \_ /usr/local/sge6.0/utilbin/lx24-amd64/ 
> qrsh_starter /usr/local/sge6.0/default/spool/comp54/active_jobs/ 
> 861188.1/2.comp54
>   397   371   397  |           \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>   424   397   397  |               \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>   425   424   397  |                   \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>   324 15981   324  \_ sge_shepherd-861188 -bg
>   325   324   325  |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rshd -l
>   400   325   400  |       \_ /usr/local/sge6.0/utilbin/lx24-amd64/ 
> qrsh_starter /usr/local/sge6.0/default/spool/comp54/active_jobs/ 
> 861188.1/3.comp54
>   444   400   444  |           \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>   465   444   444  |               \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt
>   466   465   444  |                   \_ /home/tag/aph11/P-Gadget2/ 
> cb1_2clouds_mass3_hb/mass3_hb input.txt

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list