[GE users] Problem with suspend/resume methods

Gerd Marquardt marquardt at rrzn.uni-hannover.de
Thu Mar 9 17:01:36 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I want to write a suspend and a resume method which includes also the 
suspending and resuming of our MPI jobs, and additional works  on the 
remote nodes. We have 16 nodes each with 4 CPUs.
A test which activates the suspend and resume methods shows that one 
suspend but 5 resume methods are started. This complicates my wish
to write global methods.
Here is the process tree:
31926 ?        S      0:44 /cluster/sge/bin/lx24-amd64/sge_execd
31928 ?        S      0:05  \_ /bin/ksh 
/cluster/sge/util/resources/loadsensors/notshared.sh
25029 ?        S      0:00  \_ sge_shepherd-10579 -bg
25061 ?        S      0:00  |   \_ -ksh 
/cluster/sge/default/spool/lcn02/job_scripts/10579
25132 ?        S      0:00  |   |   \_ /bin/sh 
/usr/ibgd/mpi/osu/gcc/mvapich-0.9.5/bin/mpirun -machinefile 
/tmp/10579.1.all.q/machi
25160 ?        S      0:00  |   |       \_ 
///usr/ibgd/mpi/osu/gcc/mvapich-0.9.5/bin/mpirun_rsh -rsh -np 8 
-hostfile /tmp/10579.1.a
25161 ?        S      0:00  |   |           \_ 
/cluster/sge/bin/lx24-amd64/qrsh -inherit lcn02 cd /home/zzzzmq/hello; 
/usr/bin/env
25238 ?        S      0:00  |   |           |   \_ 
/cluster/sge/utilbin/lx24-amd64/rsh -p 58550 lcn02.rrzn.uni-hannover.de 
exec '/c
25246 ?        Z      0:00  |   |           |       \_ [rsh <defunct>]
25162 ?        S      0:00  |   |           \_ 
/cluster/sge/bin/lx24-amd64/qrsh -inherit lcn02 cd /home/zzzzmq/hello; 
/usr/bin/env
25235 ?        S      0:00  |   |           |   \_ 
/cluster/sge/utilbin/lx24-amd64/rsh -p 58546 lcn02.rrzn.uni-hannover.de 
exec '/c
25244 ?        Z      0:00  |   |           |       \_ [rsh <defunct>]
25163 ?        S      0:00  |   |           \_ 
/cluster/sge/bin/lx24-amd64/qrsh -inherit lcn02 cd /home/zzzzmq/hello; 
/usr/bin/env
25233 ?        S      0:00  |   |           |   \_ 
/cluster/sge/utilbin/lx24-amd64/rsh -p 58545 lcn02.rrzn.uni-hannover.de 
exec '/c
25248 ?        Z      0:00  |   |           |       \_ [rsh <defunct>]
25164 ?        S      0:00  |   |           \_ 
/cluster/sge/bin/lx24-amd64/qrsh -inherit lcn02 cd /home/zzzzmq/hello; 
/usr/bin/env
25240 ?        S      0:00  |   |           |   \_ 
/cluster/sge/utilbin/lx24-amd64/rsh -p 58558 lcn02.rrzn.uni-hannover.de 
exec '/c
25254 ?        Z      0:00  |   |           |       \_ [rsh <defunct>]
25165 ?        S      0:00  |   |           \_ 
/cluster/sge/bin/lx24-amd64/qrsh -inherit lcn01 cd /home/zzzzmq/hello; 
/usr/bin/env
25239 ?        S      0:00  |   |           |   \_ 
/cluster/sge/utilbin/lx24-amd64/rsh -p 45080 lcn01.rrzn.uni-hannover.de 
exec '/c
25255 ?        Z      0:00  |   |           |       \_ [rsh <defunct>]
25166 ?        S      0:00  |   |           \_ 
/cluster/sge/bin/lx24-amd64/qrsh -inherit lcn01 cd /home/zzzzmq/hello; 
/usr/bin/env
25234 ?        S      0:00  |   |           |   \_ 
/cluster/sge/utilbin/lx24-amd64/rsh -p 45073 lcn01.rrzn.uni-hannover.de 
exec '/c
25241 ?        Z      0:00  |   |           |       \_ [rsh <defunct>]
25167 ?        S      0:00  |   |           \_ 
/cluster/sge/bin/lx24-amd64/qrsh -inherit lcn01 cd /home/zzzzmq/hello; 
/usr/bin/env
25237 ?        S      0:00  |   |           |   \_ 
/cluster/sge/utilbin/lx24-amd64/rsh -p 45078 lcn01.rrzn.uni-hannover.de 
exec '/c
25249 ?        Z      0:00  |   |           |       \_ [rsh <defunct>]
25168 ?        S      0:00  |   |           \_ 
/cluster/sge/bin/lx24-amd64/qrsh -inherit lcn01 cd /home/zzzzmq/hello; 
/usr/bin/env
25236 ?        S      0:00  |   |               \_ 
/cluster/sge/utilbin/lx24-amd64/rsh -p 45074 lcn01.rrzn.uni-hannover.de 
exec '/c
25242 ?        Z      0:00  |   |                   \_ [rsh <defunct>]
25269 ?        R      2:20  |   \_ /bin/sh 
/cluster/sge/cluh/suspend_mpi_test.sh 25061 10579
25378 ?        R      0:07  |   \_ /bin/sh 
/cluster/sge/cluh/resume_mpi_test.sh 25061 10579
25225 ?        S      0:00  \_ sge_shepherd-10579 -bg
25228 ?        S      0:00  |   \_ /cluster/sge/utilbin/lx24-amd64/rshd -l
25243 ?        S      0:00  |   |   \_ 
/cluster/sge/utilbin/lx24-amd64/qrsh_starter 
/cluster/sge/default/spool/lcn02/active_jobs/10
25250 ?        RL     3:10  |   |       \_ /home/zzzzmq/hello/./hello_loop.x
25382 ?        R      0:07  |   \_ /bin/sh 
/cluster/sge/cluh/resume_mpi_test.sh 25250 10579
25226 ?        S      0:00  \_ sge_shepherd-10579 -bg
25229 ?        S      0:00  |   \_ /cluster/sge/utilbin/lx24-amd64/rshd -l
25247 ?        S      0:00  |   |   \_ 
/cluster/sge/utilbin/lx24-amd64/qrsh_starter 
/cluster/sge/default/spool/lcn02/active_jobs/10
25251 ?        RL     3:09  |   |       \_ /home/zzzzmq/hello/./hello_loop.x
25379 ?        R      0:07  |   \_ /bin/sh 
/cluster/sge/cluh/resume_mpi_test.sh 25251 10579
25227 ?        S      0:00  \_ sge_shepherd-10579 -bg
25231 ?        S      0:00  |   \_ /cluster/sge/utilbin/lx24-amd64/rshd -l
25245 ?        S      0:00  |   |   \_ 
/cluster/sge/utilbin/lx24-amd64/qrsh_starter 
/cluster/sge/default/spool/lcn02/active_jobs/10
25252 ?        RL     2:59  |   |       \_ /home/zzzzmq/hello/./hello_loop.x
25380 ?        R      0:08  |   \_ /bin/sh 
/cluster/sge/cluh/resume_mpi_test.sh 25252 10579
25230 ?        S      0:00  \_ sge_shepherd-10579 -bg
25232 ?        S      0:00      \_ /cluster/sge/utilbin/lx24-amd64/rshd -l
25253 ?        S      0:00      |   \_ 
/cluster/sge/utilbin/lx24-amd64/qrsh_starter 
/cluster/sge/default/spool/lcn02/active_jobs/10
25262 ?        RL     2:30      |       \_ /home/zzzzmq/hello/./hello_loop.x
25381 ?        R      0:07      \_ /bin/sh 
/cluster/sge/cluh/resume_mpi_test.sh 25262 10579

How can I force that only one resume method is activated?

 
 Gerd Marquardt

 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list