[GE users] pvm tight integration help

Reuti reuti at staff.uni-marburg.de
Fri Jun 16 00:50:20 BST 2006


Am 15.06.2006 um 22:27 schrieb Greg A:

> Yeah, sorry, I forgot to post that....
>
> [pvmd pid5627] 06/15 16:20:40 usage: pvmd3 [-ddebugmask] [- 
> nhostname] [hostfile]

This looks like an error message from PVM. Any additonal information  
in the .po or .pe files? - Reuti

BTW: The scripts in the Howto are not identical to the ones in the  
SGE distribution.


> [pvmd pid5627] 06/15 16:20:40 pvmbailout(0)
> libpvm [pid5572] /tmp/6540.1.all.parallel.q/pvmd.26385: No such  
> file or director
> y
> libpvm [pid5572] /tmp/6540.1.all.parallel.q/pvmd.26385: No such  
> file or director
> y
> libpvm [pid5572] /tmp/6540.1.all.parallel.q/pvmd.26385: No such  
> file or director
> y
> libpvm [pid5572]: pvm_mytid(): Can't contact local daemon
>
> From what I found searching around this is an indication that the  
> pvm environment isn't being set up properly (PVM_ROOT, etc.) but as  
> I stated, I put that stuff in my dot files but it didn't help.
>
>
> On 6/15/06, Bernard Li <bli at bcgsc.ca> wrote:
> Have you checked /tmp/pvm* for pvmd log messages?
>
> Cheers,
>
> Bernard
>
> From: Greg A [mailto:clusterman at gmail.com]
> Sent: Thursday, June 15, 2006 12:48
> To: users at gridengine.sunsource.net
> Subject: [GE users] pvm tight integration help
>
> We are having some difficulty getting PVM tight integration to work  
> and we are hoping someone can help.
>
> Our test grid has a parallel queue set up with a couple different  
> pvm environments defined.  We created one to test loose integration  
> and one to test tight.  We followed the recipe that Reuti wrote and  
> for some reason our qrsh is hanging and the jobs don't start on the  
> slave nodes.  Instead it tries to transfer the job to another node  
> until all nodes Error out.
>
> We are using the tester_tight script along with the hello code  
> downloaded from the site.  We've also tried our pvm scripts and  
> code but haven't had any success there either.
>
> Here is a "ps" output I captured on the master node after  
> submitting a pvm job.
>
> # qsub -pe pvm-sf 4 tester_tight.sh
> # rsh node01 ps -e f -o pid,ppid,pgrp,command --cols=100
>  2212     1  2212 [sge_execd]
> 12064  2212 12064  \_ [sge_shepherd]
> 12065 12064 12065      \_ /bin/sh -f /sge_root/pvm/startpvm.sh - 
> catch_rsh
> 12075 12065 12065          \_ /sge_root/pvm/bin/lx24-x86/start_pvm - 
> h 4 -n node05
> 12076 12075 12065              \_ [qrsh <defunct>]
>
> Our grid is running Redhat 9 and our native pvm installation is  
> version 3.4.4.  I thought this may be an issue because Reuti's  
> recipe calls for version 3.4.5 so I went ahead and installed that  
> in my home directory.  I then repointed the pvm environment to that  
> source and it still failed and got stuck at the same place with the  
> same "ps" output.  I've also updated my .cshrc with the proper  
> PVM_ROOT and PVM_ARCH thinking that the /etc/profile.d/pvm.csh that  
> version pvm 3.4.4 installs was causing the issue.  That didn't help  
> and I still get stuck at the qrsh <defunct> spot.
>
> I'm seeing very little info on the messages files but here is an  
> example of the repeated message:
>
> 06/15/2006 11:36:39|execd|node05|W|reaping job "6506" ptf  
> complains: Job does
> not exist
> 06/15/2006 12:26:25|execd|node05|E|shepherd of job 6507.1 exited  
> with exit sta
> tus = 10
>
> We'd really appreciate any help!
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list