[GE users] pvm tight integration help

Bernard Li bli at bcgsc.ca
Thu Jun 15 21:30:01 BST 2006


Hi Greg:
 
If you get the pvmd usage, I believe that is because you have a hostname
mismatch...  for instance, I encounter this when SGE thinks the hostname
of the node should be "node01.domain" but when you ssh into node01 and
do "hostname", it's just "node01" without the FQDN.
 
Perhaps try to look in that direction and see...
 
Cheers,
 
Bernard


________________________________

	From: Greg A [mailto:clusterman at gmail.com] 
	Sent: Thursday, June 15, 2006 13:27
	To: users at gridengine.sunsource.net
	Subject: Re: [GE users] pvm tight integration help
	
	
	Yeah, sorry, I forgot to post that....
	
	[pvmd pid5627] 06/15 16:20:40 usage: pvmd3 [-ddebugmask]
[-nhostname] [hostfile]
	[pvmd pid5627] 06/15 16:20:40 pvmbailout(0)
	libpvm [pid5572] /tmp/6540.1.all.parallel.q/pvmd.26385: No such
file or director 
	y
	libpvm [pid5572] /tmp/6540.1.all.parallel.q/pvmd.26385: No such
file or director
	y
	libpvm [pid5572] /tmp/6540.1.all.parallel.q/pvmd.26385: No such
file or director
	y
	libpvm [pid5572]: pvm_mytid(): Can't contact local daemon 
	
	From what I found searching around this is an indication that
the pvm environment isn't being set up properly (PVM_ROOT, etc.) but as
I stated, I put that stuff in my dot files but it didn't help.
	
	
	
	On 6/15/06, Bernard Li <bli at bcgsc.ca> wrote: 

		Have you checked /tmp/pvm* for pvmd log messages?
		 
		Cheers,
		 
		Bernard


________________________________

			From: Greg A [mailto:clusterman at gmail.com] 
			Sent: Thursday, June 15, 2006 12:48
			To: users at gridengine.sunsource.net
			Subject: [GE users] pvm tight integration help
			
			

		
		We are having some difficulty getting PVM tight
integration to work and we are hoping someone can help.
		
		Our test grid has a parallel queue set up with a couple
different pvm environments defined.  We created one to test loose
integration and one to test tight.  We followed the recipe that Reuti
wrote and for some reason our qrsh is hanging and the jobs don't start
on the slave nodes.  Instead it tries to transfer the job to another
node until all nodes Error out. 
		
		We are using the tester_tight script along with the
hello code downloaded from the site.  We've also tried our pvm scripts
and code but haven't had any success there either.
		
		Here is a "ps" output I captured on the master node
after submitting a pvm job. 
		
		
		# qsub -pe pvm-sf 4 tester_tight.sh
		# rsh node01 ps -e f -o pid,ppid,pgrp,command --cols=100
		 2212     1  2212 [sge_execd]
		12064  2212 12064  \_ [sge_shepherd]
		12065 12064 12065      \_ /bin/sh -f
/sge_root/pvm/startpvm.sh -catch_rsh 
		12075 12065 12065          \_
/sge_root/pvm/bin/lx24-x86/start_pvm -h 4 -n node05
		12076 12075 12065              \_ [qrsh <defunct>]
		

		Our grid is running Redhat 9 and our native pvm
installation is version 3.4.4.  I thought this may be an issue because
Reuti's recipe calls for version 3.4.5 so I went ahead and installed
that in my home directory.  I then repointed the pvm environment to that
source and it still failed and got stuck at the same place with the same
"ps" output.  I've also updated my .cshrc with the proper PVM_ROOT and
PVM_ARCH thinking that the /etc/profile.d/pvm.csh that version pvm 3.4.4
installs was causing the issue.  That didn't help and I still get stuck
at the qrsh <defunct> spot.
		
		I'm seeing very little info on the messages files but
here is an example of the repeated message:
		
		
		06/15/2006 11:36:39|execd|node05|W|reaping job "6506"
ptf complains: Job does
		not exist
		06/15/2006 12:26:25|execd|node05|E|shepherd of job
6507.1 exited with exit sta
		tus = 10 
		

		We'd really appreciate any help!  
		





More information about the gridengine-users mailing list