[GE users] SGE6.2u4 and mvapich2_mpd tight integration problem when qdel

reuti reuti at staff.uni-marburg.de
Thu Nov 26 00:30:06 GMT 2009


Hi,

Am 24.11.2009 um 14:47 schrieb godroom:

> Hi experts,
>
> I configured my mvapich2_mpd tight integration as ( http:// 
> gridengine.sunsource.net/howto/mpich2-integration/mpich2- 
> integration.html ) and the mpi job starts and runs well, but there  
> is a problem when job is deleted ( or finished ).
>
> I think that mpd daemons are disappeared before the stopmpich2.sh  
> running.

exactly this can happen. Maybe because the mpd on the master node of  
this job was killed already, hence they also quit on all nodes due to  
missing communication. Then the processes you found are detached and  
SGE thinks there is nothing left on the nodes because the mpd and so  
the shephered left already. Hence it will never kill anything by the  
additonal group id. It's already an RFE to make in such cases a  
safety-kill on all last known additional group ids on the slave  
nodes. For now you can try this:

In the queue definition:

$ qconf -sq all.q
...
terminate_method      /home/reuti/killkids.sh $job_pid
...

A script getkids.sh:

#!/bin/sh
#
# Argument $1 must be the process id (i.e. $job_pid)
#

group=`awk '/^Groups/ { for (i=2;i<=NF;i++) if ($i>=20000 &&  
$i<=21000) { print $i }}' /proc/$1/status`

for process in /proc/[0-9]*; do awk '/^Groups/ { for (i=2;i<=NF;i++)  
if ($i==group) { print process }}' process=${process##*/} group= 
$group $process/status; done

And a script killkids.sh:

#!/bin/sh
#
# Argument $1 must be the process id (i.e. $job_pid)
#

sleep 5

/home/reuti/getkids.sh $1 | xargs kill -9

exit 0


These two scripts can be combined into one of course when you define  
getkids as a function. This script will first try to get the  
additional group of the job (adjust the interval 20000 to 21000 to  
the one you use), and then get all the processes which are attached  
to this additional group id.

-- Reuti


> Do you have any idea how I can clean up the nodes after qdel using  
> these stopmpich2.sh
>
> I list the configs and outputs below.
>
> Thanks in advance.
> KC
>
> 1. PE setting
>
> pe_name            mvapich2_mpd
> slots              99999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /applic/sge/mpich2_mpd/startmpich2.sh -catch_rsh \
>                    $pe_hostfile \
>                    /applic/compilers/pgi/linux86-64/9.0-4/mpi/ 
> mvapich2/1.2p1
> stop_proc_args     /applic/sge/mpich2_mpd/stopmpich2.sh -catch_rsh \
>                    /applic/compilers/pgi/linux86-64/9.0-4/mpi/ 
> mvapich2/1.2p1
> allocation_rule    $fill_up
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
>
>
> 2. The output ( #$ -j y )
>
> -catch_rsh /applic/sge/default/spool/s0006/active_jobs/140.1/ 
> pe_hostfile /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1
> s0006:8
> s0007:8
> startmpich2.sh: check for local mpd daemon (1 of 10)
> /applic/sge/bin/lx24-amd64/qrsh -inherit -V s0006 /applic/compilers/ 
> pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
> startmpich2.sh: check for local mpd daemon (2 of 10)
> startmpich2.sh: check for mpd daemons (1 of 10)
> /applic/sge/bin/lx24-amd64/qrsh -inherit -V s0007 /applic/compilers/ 
> pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 58530 -n
> startmpich2.sh: check for mpd daemons (2 of 10)
> startmpich2.sh: check for mpd daemons (3 of 10)
> startmpich2.sh: check for mpd daemons (4 of 10)
> startmpich2.sh: got all 2 of 2 nodes
> Hello World from Node 0.
> Hello World from Node 1.
> Hello World from Node 3.
> Hello World from Node 4.
> Hello World from Node 6.
> Hello World from Node 5.
> Hello World from Node 14.
> Hello World from Node 7.
> Hello World from Node 8.
> Hello World from Node 9.
> Hello World from Node 10.
> Hello World from Node 11.
> Hello World from Node 2.
> Hello World from Node 13.
> Hello World from Node 12.
> Hello World from Node 15.
> -catch_rsh /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1
> mpdallexit: cannot connect to local mpd (/tmp/ 
> mpd2.console_sgeadmin_sge_140.undefined); possible causes:
>   1. no mpd is running on this host
>   2. an mpd is running but was started without a "console" (-n option)
> In case 1, you can start an mpd on this host with:
>     mpd &
> and you will be able to run jobs just on this host.
> For more details on starting mpds on a set of hosts, see
> the MVAPICH2 User Guide.
>
> 3. s0006 server's process list when running
>  8551     1  8551 /applic/sge/bin/lx24-amd64/sge_execd
> 16279  8551 16279  \_ sge_shepherd-141 -bg
> 16346 16279 16346  |   \_ /bin/bash /applic/sge/default/spool/s0006/ 
> job_scripts/141
> 16350 16346 16346  |       \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpiexec -machinefile /tmp/ 
> 141.1.all.q/machines -n 16 ./mpihello
> 16312  8551 16312  \_ sge_shepherd-141 -bg
> 16313 16312 16313      \_ /applic/sge/utilbin/lx24-amd64/ 
> qrsh_starter /applic/sge/default/spool/s0006/active_jobs/141.1/1.s0006
> 16322 16313 16322          \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
> 16351 16322 16351              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
> 16370 16351 16370              |   \_ ./mpihello
> 16352 16322 16352              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
> 16363 16352 16363              |   \_ ./mpihello
> 16353 16322 16353              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
> 16364 16353 16364              |   \_ ./mpihello
> 16354 16322 16354              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
> 16365 16354 16365              |   \_ ./mpihello
> 16355 16322 16355              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
> 16366 16355 16366              |   \_ ./mpihello
> 16356 16322 16356              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
> 16367 16356 16367              |   \_ ./mpihello
> 16357 16322 16357              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
> 16368 16357 16368              |   \_ ./mpihello
> 16358 16322 16358              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
> 16369 16358 16369                  \_ ./mpihello
>  9482     1  9482 /tmp/netserver -L s0006 -p 15000
>  9614     1  9614 xinetd -stayalive -pidfile /var/run/xinetd.pid
> 10027     1     1 [ldlm_bl_02]
> 14494     1     1 [ldlm_cb_02]
> 16303     1 16280 /applic/sge/bin/lx24-amd64/qrsh -inherit -V  
> s0006 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/ 
> mpd
> 16328     1 16280 /applic/sge/bin/lx24-amd64/qrsh -inherit -V  
> s0007 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/ 
> mpd -h s0006 -p 40808 -n
>
> 4. s0007 server's process list when running
>  8585     1  8585 /applic/sge/bin/lx24-amd64/sge_execd
> 11491  8585 11491  \_ sge_shepherd-141 -bg
> 11492 11491 11492      \_ /applic/sge/utilbin/lx24-amd64/ 
> qrsh_starter /applic/sge/default/spool/s0007/active_jobs/141.1/1.s0007
> 11499 11492 11499          \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
> 11500 11499 11500              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
> 11508 11500 11508              |   \_ ./mpihello
> 11501 11499 11501              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
> 11509 11501 11509              |   \_ ./mpihello
> 11502 11499 11502              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
> 11510 11502 11510              |   \_ ./mpihello
> 11503 11499 11503              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
> 11511 11503 11511              |   \_ ./mpihello
> 11504 11499 11504              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
> 11512 11504 11512              |   \_ ./mpihello
> 11505 11499 11505              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
> 11513 11505 11513              |   \_ ./mpihello
> 11506 11499 11506              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
> 11514 11506 11514              |   \_ ./mpihello
> 11507 11499 11507              \_ python2.4 /applic/compilers/pgi/ 
> linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
> 11515 11507 11515                  \_ ./mpihello
>
> 5. s0006 server's process list after qdel
>
>  8551     1  8551 /applic/sge/bin/lx24-amd64/sge_execd
>
>
> 6. s0007 server's host's process list after qdel
> 11508     1 11508 ./mpihello
> 11509     1 11509 ./mpihello
> 11510     1 11510 ./mpihello
> 11511     1 11511 ./mpihello
> 11512     1 11512 ./mpihello
> 11513     1 11513 ./mpihello
> 11514     1 11514 ./mpihello
> 11515     1 11515 ./mpihello
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=229017
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=229426

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list