[GE users] pvm tight integration help

Bernard Li bli at bcgsc.ca
Thu Jun 15 20:57:54 BST 2006


Have you checked /tmp/pvm* for pvmd log messages?
 
Cheers,
 
Bernard


________________________________

	From: Greg A [mailto:clusterman at gmail.com] 
	Sent: Thursday, June 15, 2006 12:48
	To: users at gridengine.sunsource.net
	Subject: [GE users] pvm tight integration help
	
	
	We are having some difficulty getting PVM tight integration to
work and we are hoping someone can help.
	
	Our test grid has a parallel queue set up with a couple
different pvm environments defined.  We created one to test loose
integration and one to test tight.  We followed the recipe that Reuti
wrote and for some reason our qrsh is hanging and the jobs don't start
on the slave nodes.  Instead it tries to transfer the job to another
node until all nodes Error out. 
	
	We are using the tester_tight script along with the hello code
downloaded from the site.  We've also tried our pvm scripts and code but
haven't had any success there either.
	
	Here is a "ps" output I captured on the master node after
submitting a pvm job. 
	
	
	# qsub -pe pvm-sf 4 tester_tight.sh
	# rsh node01 ps -e f -o pid,ppid,pgrp,command --cols=100
	 2212     1  2212 [sge_execd]
	12064  2212 12064  \_ [sge_shepherd]
	12065 12064 12065      \_ /bin/sh -f /sge_root/pvm/startpvm.sh
-catch_rsh 
	12075 12065 12065          \_
/sge_root/pvm/bin/lx24-x86/start_pvm -h 4 -n node05
	12076 12075 12065              \_ [qrsh <defunct>]
	

	Our grid is running Redhat 9 and our native pvm installation is
version 3.4.4.  I thought this may be an issue because Reuti's recipe
calls for version 3.4.5 so I went ahead and installed that in my home
directory.  I then repointed the pvm environment to that source and it
still failed and got stuck at the same place with the same "ps" output.
I've also updated my .cshrc with the proper PVM_ROOT and PVM_ARCH
thinking that the /etc/profile.d/pvm.csh that version pvm 3.4.4 installs
was causing the issue.  That didn't help and I still get stuck at the
qrsh <defunct> spot.
	
	I'm seeing very little info on the messages files but here is an
example of the repeated message:
	
	
	06/15/2006 11:36:39|execd|node05|W|reaping job "6506" ptf
complains: Job does
	not exist
	06/15/2006 12:26:25|execd|node05|E|shepherd of job 6507.1 exited
with exit sta
	tus = 10 
	

	We'd really appreciate any help!  
	




More information about the gridengine-users mailing list