[GE users] Rogue MPI processes even with tight integration

Chris Rudge chris.rudge at astro.le.ac.uk
Mon Sep 3 13:47:56 BST 2007


On Mon, 2007-09-03 at 14:37 +0200, Reuti wrote:

> ps -e f -o pid,ppid,pgrp,command --cols=500
> 

On the master:

23665     1 23665 /usr/local/sge6.0/bin/lx24-amd64/sge_execd
 9278 23665  9278  \_ sge_shepherd-861188 -bg
 9292  9278  9292  |   \_ /bin/csh /usr/local/sge6.0/default/spool/comp63/job_scripts/861188
 9316  9292  9292  |       \_ perl -S -w /usr/local/mpich-mx/path/bin/mpirun.ch_mx.pl --mx-kill 15 -r -np 8 -machinefile /tmp/mpi/861188.1.mpi.q/machines /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb "input.txt"
 9353  9316  9292  |           \_ perl -S -w /usr/local/mpich-mx/path/bin/mpirun.ch_mx.pl --mx-kill 15 -r -np 8 -machinefile /tmp/mpi/861188.1.mpi.q/machines /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb "input.txt"
 9354  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/qrsh -inherit comp54.star.le.ac.uk cd /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63 MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1 MXMPI_MAGIC=5984009 MXMPI_ID=0 MXMPI_NP=8 MXMPI_BOARD=-1 MXMPI_SLAVE=192.168.1.154 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb "input.txt"
 9427  9354  9292  |           |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rsh -p 49466 comp54.star.le.ac.uk exec '/usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/spool/comp54/active_jobs/861188.1/1.comp54'
 9430  9427  9292  |           |       \_ [rsh] <defunct>
 9355  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/qrsh -inherit -nostdin comp54.star.le.ac.uk cd /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63 MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1 MXMPI_MAGIC=5984009 MXMPI_ID=1 MXMPI_NP=8 MXMPI_BOARD=-1 MXMPI_SLAVE=192.168.1.154 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb "input.txt"
 9518  9355  9292  |           |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rsh -n -p 49480 comp54.star.le.ac.uk exec '/usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/spool/comp54/active_jobs/861188.1/3.comp54'
 9356  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/qrsh -inherit -nostdin comp54.star.le.ac.uk cd /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63 MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1 MXMPI_MAGIC=5984009 MXMPI_ID=2 MXMPI_NP=8 MXMPI_BOARD=-1 MXMPI_SLAVE=192.168.1.154 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb "input.txt"
 9456  9356  9292  |           |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rsh -n -p 49473 comp54.star.le.ac.uk exec '/usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/spool/comp54/active_jobs/861188.1/2.comp54'
 9357  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/qrsh -inherit -nostdin comp58.star.le.ac.uk cd /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63 MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1 MXMPI_MAGIC=5984009 MXMPI_ID=3 MXMPI_NP=8 MXMPI_BOARD=-1 MXMPI_SLAVE=192.168.1.158 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb "input.txt"
 9426  9357  9292  |           |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rsh -n -p 52516 comp58.star.le.ac.uk exec '/usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/spool/comp58/active_jobs/861188.1/1.comp58'
 9358  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/qrsh -inherit -nostdin comp63.star.le.ac.uk cd /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63 MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1 MXMPI_MAGIC=5984009 MXMPI_ID=4 MXMPI_NP=8 MXMPI_BOARD=-1 MXMPI_SLAVE=192.168.1.163 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb "input.txt"
 9455  9358  9292  |           |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rsh -n -p 34876 comp63.star.le.ac.uk exec '/usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/spool/comp63/active_jobs/861188.1/2.comp63'
 9359  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/qrsh -inherit -nostdin comp63.star.le.ac.uk cd /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63 MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1 MXMPI_MAGIC=5984009 MXMPI_ID=5 MXMPI_NP=8 MXMPI_BOARD=-1 MXMPI_SLAVE=192.168.1.163 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb "input.txt"
 9504  9359  9292  |           |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rsh -n -p 34887 comp63.star.le.ac.uk exec '/usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/spool/comp63/active_jobs/861188.1/3.comp63'
 9360  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/qrsh -inherit -nostdin comp63.star.le.ac.uk cd /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63 MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1 MXMPI_MAGIC=5984009 MXMPI_ID=6 MXMPI_NP=8 MXMPI_BOARD=-1 MXMPI_SLAVE=192.168.1.163 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb "input.txt"
 9428  9360  9292  |           |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rsh -n -p 34867 comp63.star.le.ac.uk exec '/usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/spool/comp63/active_jobs/861188.1/1.comp63'
 9361  9316  9292  |           \_ /usr/local/sge6.0/bin/lx24-amd64/qrsh -inherit -nostdin comp63.star.le.ac.uk cd /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb && exec env MXMPI_MASTER=comp63 MXMPI_PORT=34826 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1 MXMPI_MAGIC=5984009 MXMPI_ID=7 MXMPI_NP=8 MXMPI_BOARD=-1 MXMPI_SLAVE=192.168.1.163 /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb "input.txt"
 9553  9361  9292  |               \_ /usr/local/sge6.0/utilbin/lx24-amd64/rsh -n -p 34901 comp63.star.le.ac.uk exec '/usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge6.0/default/spool/comp63/active_jobs/861188.1/4.comp63'
 9418 23665  9418  \_ sge_shepherd-861188 -bg
 9419  9418  9419  |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rshd -l
 9429  9419  9429  |       \_ /usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/comp63/active_jobs/861188.1/1.comp63
 9454  9429  9454  |           \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
 9478  9454  9454  |               \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
 9479  9478  9454  |                   \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
 9420 23665  9420  \_ sge_shepherd-861188 -bg
 9421  9420  9421  |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rshd -l
 9477  9421  9477  |       \_ /usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/comp63/active_jobs/861188.1/2.comp63
 9503  9477  9503  |           \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
 9527  9503  9503  |               \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
 9528  9527  9503  |                   \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
 9422 23665  9422  \_ sge_shepherd-861188 -bg
 9423  9422  9423  |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rshd -l
 9526  9423  9526  |       \_ /usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/comp63/active_jobs/861188.1/3.comp63
 9552  9526  9552  |           \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
 9595  9552  9552  |               \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
 9596  9595  9552  |                   \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
 9424 23665  9424  \_ sge_shepherd-861188 -bg
 9425  9424  9425      \_ /usr/local/sge6.0/utilbin/lx24-amd64/rshd -l
 9555  9425  9555          \_ /usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/comp63/active_jobs/861188.1/4.comp63
 9600  9555  9600              \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
 9621  9600  9600                  \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
 9622  9621  9600                      \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt




and on a slave (yes, it's correct that there are only 3 processes
running on this slave)

15981     1 15981 /usr/local/sge6.0/bin/lx24-amd64/sge_execd
  320 15981   320  \_ sge_shepherd-861188 -bg
  321   320   321  |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rshd -l
  326   321   326  |       \_ /usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/comp54/active_jobs/861188.1/1.comp54
  350   326   350  |           \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
  372   350   350  |               \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
  373   372   350  |                   \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
  322 15981   322  \_ sge_shepherd-861188 -bg
  323   322   323  |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rshd -l
  371   323   371  |       \_ /usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/comp54/active_jobs/861188.1/2.comp54
  397   371   397  |           \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
  424   397   397  |               \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
  425   424   397  |                   \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
  324 15981   324  \_ sge_shepherd-861188 -bg
  325   324   325  |   \_ /usr/local/sge6.0/utilbin/lx24-amd64/rshd -l
  400   325   400  |       \_ /usr/local/sge6.0/utilbin/lx24-amd64/qrsh_starter /usr/local/sge6.0/default/spool/comp54/active_jobs/861188.1/3.comp54
  444   400   444  |           \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
  465   444   444  |               \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt
  466   465   444  |                   \_ /home/tag/aph11/P-Gadget2/cb1_2clouds_mass3_hb/mass3_hb input.txt




-- 
Dr Chris Rudge
chris.rudge at astro.le.ac.uk

UKAFF Facility Manager & Dept. Research Computing Manager
Dept of Physics & Astronomy
University of Leicester
LE1 7RH

web.  www.ukaff.ac.uk
Tel.  +44 (0)116 2523331
Fax.  +44 (0)116 2231283
Mob.  +44 (0)794 1379420


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list