[GE users] SGE6.2u4 and mvapich2_mpd tight integration problem when qdel

godroom godroom3 at gmail.com
Tue Nov 24 13:47:51 GMT 2009


Hi experts,

I configured my mvapich2_mpd tight integration as ( http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html ) and the mpi job starts and runs well, but there is a problem when job is deleted ( or finished ).

I think that mpd daemons are disappeared before the stopmpich2.sh running.

Do you have any idea how I can clean up the nodes after qdel using these stopmpich2.sh 

I list the configs and outputs below.

Thanks in advance.
KC

1. PE setting

pe_name            mvapich2_mpd
slots              99999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /applic/sge/mpich2_mpd/startmpich2.sh -catch_rsh \
                   $pe_hostfile \
                   /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1
stop_proc_args     /applic/sge/mpich2_mpd/stopmpich2.sh -catch_rsh \
                   /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE


2. The output ( #$ -j y )

-catch_rsh /applic/sge/default/spool/s0006/active_jobs/140.1/pe_hostfile /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1
s0006:8
s0007:8
startmpich2.sh: check for local mpd daemon (1 of 10)
/applic/sge/bin/lx24-amd64/qrsh -inherit -V s0006 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
startmpich2.sh: check for local mpd daemon (2 of 10)
startmpich2.sh: check for mpd daemons (1 of 10)
/applic/sge/bin/lx24-amd64/qrsh -inherit -V s0007 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 58530 -n
startmpich2.sh: check for mpd daemons (2 of 10)
startmpich2.sh: check for mpd daemons (3 of 10)
startmpich2.sh: check for mpd daemons (4 of 10)
startmpich2.sh: got all 2 of 2 nodes
Hello World from Node 0.
Hello World from Node 1.
Hello World from Node 3.
Hello World from Node 4.
Hello World from Node 6.
Hello World from Node 5.
Hello World from Node 14.
Hello World from Node 7.
Hello World from Node 8.
Hello World from Node 9.
Hello World from Node 10.
Hello World from Node 11.
Hello World from Node 2.
Hello World from Node 13.
Hello World from Node 12.
Hello World from Node 15.
-catch_rsh /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1
mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_sgeadmin_sge_140.undefined); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
    mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MVAPICH2 User Guide.

3. s0006 server's process list when running
 8551     1  8551 /applic/sge/bin/lx24-amd64/sge_execd
16279  8551 16279  \_ sge_shepherd-141 -bg
16346 16279 16346  |   \_ /bin/bash /applic/sge/default/spool/s0006/job_scripts/141
16350 16346 16346  |       \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpiexec -machinefile /tmp/141.1.all.q/machines -n 16 ./mpihello
16312  8551 16312  \_ sge_shepherd-141 -bg
16313 16312 16313      \_ /applic/sge/utilbin/lx24-amd64/qrsh_starter /applic/sge/default/spool/s0006/active_jobs/141.1/1.s0006
16322 16313 16322          \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
16351 16322 16351              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
16370 16351 16370              |   \_ ./mpihello
16352 16322 16352              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
16363 16352 16363              |   \_ ./mpihello
16353 16322 16353              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
16364 16353 16364              |   \_ ./mpihello
16354 16322 16354              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
16365 16354 16365              |   \_ ./mpihello
16355 16322 16355              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
16366 16355 16366              |   \_ ./mpihello
16356 16322 16356              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
16367 16356 16367              |   \_ ./mpihello
16357 16322 16357              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
16368 16357 16368              |   \_ ./mpihello
16358 16322 16358              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
16369 16358 16369                  \_ ./mpihello
 9482     1  9482 /tmp/netserver -L s0006 -p 15000
 9614     1  9614 xinetd -stayalive -pidfile /var/run/xinetd.pid
10027     1     1 [ldlm_bl_02]
14494     1     1 [ldlm_cb_02]
16303     1 16280 /applic/sge/bin/lx24-amd64/qrsh -inherit -V s0006 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd
16328     1 16280 /applic/sge/bin/lx24-amd64/qrsh -inherit -V s0007 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n

4. s0007 server's process list when running
 8585     1  8585 /applic/sge/bin/lx24-amd64/sge_execd
11491  8585 11491  \_ sge_shepherd-141 -bg
11492 11491 11492      \_ /applic/sge/utilbin/lx24-amd64/qrsh_starter /applic/sge/default/spool/s0007/active_jobs/141.1/1.s0007
11499 11492 11499          \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
11500 11499 11500              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
11508 11500 11508              |   \_ ./mpihello
11501 11499 11501              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
11509 11501 11509              |   \_ ./mpihello
11502 11499 11502              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
11510 11502 11510              |   \_ ./mpihello
11503 11499 11503              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
11511 11503 11511              |   \_ ./mpihello
11504 11499 11504              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
11512 11504 11512              |   \_ ./mpihello
11505 11499 11505              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
11513 11505 11513              |   \_ ./mpihello
11506 11499 11506              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
11514 11506 11514              |   \_ ./mpihello
11507 11499 11507              \_ python2.4 /applic/compilers/pgi/linux86-64/9.0-4/mpi/mvapich2/1.2p1/bin/mpd -h s0006 -p 40808 -n
11515 11507 11515                  \_ ./mpihello

5. s0006 server's process list after qdel

 8551     1  8551 /applic/sge/bin/lx24-amd64/sge_execd


6. s0007 server's host's process list after qdel
11508     1 11508 ./mpihello
11509     1 11509 ./mpihello
11510     1 11510 ./mpihello
11511     1 11511 ./mpihello
11512     1 11512 ./mpihello
11513     1 11513 ./mpihello
11514     1 11514 ./mpihello
11515     1 11515 ./mpihello

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=229017

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list