Wed Jan 12 20:38:46 GMT 2011
environment isn't being set up properly (PVM_ROOT, etc.) but as I stated, I
put that stuff in my dot files but it didn't help.
On 6/15/06, Bernard Li <bli at bcgsc.ca> wrote:
> Have you checked /tmp/pvm* for pvmd log messages?
> *From:* Greg A [mailto:clusterman at gmail.com]
> *Sent:* Thursday, June 15, 2006 12:48
> *To:* users at gridengine.sunsource.net
> *Subject:* [GE users] pvm tight integration help
> We are having some difficulty getting PVM tight integration to work and we
> are hoping someone can help.
> Our test grid has a parallel queue set up with a couple different pvm
> environments defined. We created one to test loose integration and one to
> test tight. We followed the recipe that Reuti wrote and for some reason our
> qrsh is hanging and the jobs don't start on the slave nodes. Instead it
> tries to transfer the job to another node until all nodes Error out.
> We are using the tester_tight script along with the hello code downloaded
> from the site. We've also tried our pvm scripts and code but haven't had
> any success there either.
> Here is a "ps" output I captured on the master node after submitting a pvm
> # qsub -pe pvm-sf 4 tester_tight.sh
> # rsh node01 ps -e f -o pid,ppid,pgrp,command --cols=100
> 2212 1 2212 [sge_execd]
> 12064 2212 12064 \_ [sge_shepherd]
> 12065 12064 12065 \_ /bin/sh -f /sge_root/pvm/startpvm.sh -catch_rsh
> 12075 12065 12065 \_ /sge_root/pvm/bin/lx24-x86/start_pvm -h 4 -n
> 12076 12075 12065 \_ [qrsh <defunct>]
> Our grid is running Redhat 9 and our native pvm installation is version
> 3.4.4. I thought this may be an issue because Reuti's recipe calls for
> version 3.4.5 so I went ahead and installed that in my home directory. I
> then repointed the pvm environment to that source and it still failed and
> got stuck at the same place with the same "ps" output. I've also updated my
> .cshrc with the proper PVM_ROOT and PVM_ARCH thinking that the
> /etc/profile.d/pvm.csh that version pvm 3.4.4 installs was causing the
> issue. That didn't help and I still get stuck at the qrsh <defunct> spot.
> I'm seeing very little info on the messages files but here is an example
> of the repeated message:
> 06/15/2006 11:36:39|execd|node05|W|reaping job "6506" ptf complains: Job
> not exist
> 06/15/2006 12:26:25|execd|node05|E|shepherd of job 6507.1 exited with exit
> tus = 10
> We'd really appreciate any help!
More information about the gridengine-users