[GE users] pvm tight integration help

Greg A clusterman at gmail.com
Thu Jun 15 20:47:38 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

We are having some difficulty getting PVM tight integration to work and we
are hoping someone can help.

Our test grid has a parallel queue set up with a couple different pvm
environments defined.  We created one to test loose integration and one to
test tight.  We followed the recipe that Reuti wrote and for some reason our
qrsh is hanging and the jobs don't start on the slave nodes.  Instead it
tries to transfer the job to another node until all nodes Error out.

We are using the tester_tight script along with the hello code downloaded
from the site.  We've also tried our pvm scripts and code but haven't had
any success there either.

Here is a "ps" output I captured on the master node after submitting a pvm
job.

# qsub -pe pvm-sf 4 tester_tight.sh
# rsh node01 ps -e f -o pid,ppid,pgrp,command --cols=100
 2212     1  2212 [sge_execd]
12064  2212 12064  \_ [sge_shepherd]
12065 12064 12065      \_ /bin/sh -f /sge_root/pvm/startpvm.sh -catch_rsh
12075 12065 12065          \_ /sge_root/pvm/bin/lx24-x86/start_pvm -h 4 -n
node05
12076 12075 12065              \_ [qrsh <defunct>]

Our grid is running Redhat 9 and our native pvm installation is version
3.4.4.  I thought this may be an issue because Reuti's recipe calls for
version 3.4.5 so I went ahead and installed that in my home directory.  I
then repointed the pvm environment to that source and it still failed and
got stuck at the same place with the same "ps" output.  I've also updated my
.cshrc with the proper PVM_ROOT and PVM_ARCH thinking that the
/etc/profile.d/pvm.csh that version pvm 3.4.4 installs was causing the
issue.  That didn't help and I still get stuck at the qrsh <defunct> spot.

I'm seeing very little info on the messages files but here is an example of
the repeated message:

06/15/2006 11:36:39|execd|node05|W|reaping job "6506" ptf complains: Job
does
not exist
06/15/2006 12:26:25|execd|node05|E|shepherd of job 6507.1 exited with exit
sta
tus = 10

We'd really appreciate any help!



More information about the gridengine-users mailing list