[GE users] pvm tight integration help

Greg A clusterman at gmail.com
Mon Jun 19 22:57:43 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Sorry for not responding sooner.  The messages I posted earlier were from
the .pe files and there really wasn't anything other than that message
repeated.  I'm attaching the message that is also repetitive from the .po
files.


-catch_rsh /sge_root/default/spool/node01/active_jobs/6540.1/pe_hostfile
node01
.domain.com /usr/share/pvm3
/sge_root/bin/lx24-x86/qrsh -V -inherit node01.domain.com env
PVM_DPATH=/usr/s
hare/pvm3/lib/pvmd3 PVM_TMP=$TMPDIR /usr/share/pvm3/lib/pvmd
/tmp/6540.1.all.parallel.q/hostfile
 -f
startpvm.sh: startup failed - invoking cleanup script


I'm going to take Bernard's advice and poke around to see if the hostname's
may be an issue.

-Greg


On 6/15/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>
> Am 15.06.2006 um 22:27 schrieb Greg A:
>
> > Yeah, sorry, I forgot to post that....
> >
> > [pvmd pid5627] 06/15 16:20:40 usage: pvmd3 [-ddebugmask] [-
> > nhostname] [hostfile]
>
> This looks like an error message from PVM. Any additonal information
> in the .po or .pe files? - Reuti
>
> BTW: The scripts in the Howto are not identical to the ones in the
> SGE distribution.
>
>
> > [pvmd pid5627] 06/15 16:20:40 pvmbailout(0)
> > libpvm [pid5572] /tmp/6540.1.all.parallel.q/pvmd.26385: No such
> > file or director
> > y
> > libpvm [pid5572] /tmp/6540.1.all.parallel.q/pvmd.26385: No such
> > file or director
> > y
> > libpvm [pid5572] /tmp/6540.1.all.parallel.q/pvmd.26385: No such
> > file or director
> > y
> > libpvm [pid5572]: pvm_mytid(): Can't contact local daemon
> >
> > From what I found searching around this is an indication that the
> > pvm environment isn't being set up properly (PVM_ROOT, etc.) but as
> > I stated, I put that stuff in my dot files but it didn't help.
> >
> >
> > On 6/15/06, Bernard Li <bli at bcgsc.ca> wrote:
> > Have you checked /tmp/pvm* for pvmd log messages?
> >
> > Cheers,
> >
> > Bernard
> >
> > From: Greg A [mailto:clusterman at gmail.com]
> > Sent: Thursday, June 15, 2006 12:48
> > To: users at gridengine.sunsource.net
> > Subject: [GE users] pvm tight integration help
> >
> > We are having some difficulty getting PVM tight integration to work
> > and we are hoping someone can help.
> >
> > Our test grid has a parallel queue set up with a couple different
> > pvm environments defined.  We created one to test loose integration
> > and one to test tight.  We followed the recipe that Reuti wrote and
> > for some reason our qrsh is hanging and the jobs don't start on the
> > slave nodes.  Instead it tries to transfer the job to another node
> > until all nodes Error out.
> >
> > We are using the tester_tight script along with the hello code
> > downloaded from the site.  We've also tried our pvm scripts and
> > code but haven't had any success there either.
> >
> > Here is a "ps" output I captured on the master node after
> > submitting a pvm job.
> >
> > # qsub -pe pvm-sf 4 tester_tight.sh
> > # rsh node01 ps -e f -o pid,ppid,pgrp,command --cols=100
> >  2212     1  2212 [sge_execd]
> > 12064  2212 12064  \_ [sge_shepherd]
> > 12065 12064 12065      \_ /bin/sh -f /sge_root/pvm/startpvm.sh -
> > catch_rsh
> > 12075 12065 12065          \_ /sge_root/pvm/bin/lx24-x86/start_pvm -
> > h 4 -n node05
> > 12076 12075 12065              \_ [qrsh <defunct>]
> >
> > Our grid is running Redhat 9 and our native pvm installation is
> > version 3.4.4.  I thought this may be an issue because Reuti's
> > recipe calls for version 3.4.5 so I went ahead and installed that
> > in my home directory.  I then repointed the pvm environment to that
> > source and it still failed and got stuck at the same place with the
> > same "ps" output.  I've also updated my .cshrc with the proper
> > PVM_ROOT and PVM_ARCH thinking that the /etc/profile.d/pvm.csh that
> > version pvm 3.4.4 installs was causing the issue.  That didn't help
> > and I still get stuck at the qrsh <defunct> spot.
> >
> > I'm seeing very little info on the messages files but here is an
> > example of the repeated message:
> >
> > 06/15/2006 11:36:39|execd|node05|W|reaping job "6506" ptf
> > complains: Job does
> > not exist
> > 06/15/2006 12:26:25|execd|node05|E|shepherd of job 6507.1 exited
> > with exit sta
> > tus = 10
> >
> > We'd really appreciate any help!
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>



More information about the gridengine-users mailing list