[GE users] Migration Issue + Conditional "failed receiving gdi request" error

McCalla, Mac macmccalla at hess.com
Mon Aug 13 20:22:59 BST 2007


Hi Jonathan
No, we use classic spooling approach here (currently 6.0u7), and can
migrate (most recently last month) between qmaster and shadow systems
without doing anything more than the migrate command.   Is the
environment variable setup the same when logging into the cerebro-rmn2
machine? Is it possible there is a mismatch in sge software levels
between the 2 systems?  Are there any informative messages in the shadow
master messages file? (
$SGE_ROOT/$CELL/spool/qmaster/messages_shadowd.$hostname )
 
Mac

________________________________

From: Jonathan Pierce [mailto:jonathan.pierce at loni.ucla.edu] 
Sent: Monday, August 13, 2007 12:38 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Migration Issue + Conditional "failed receiving
gdi request" error


Hi Reuti, 

Thank you for the reply.  The shadow master was set up as described, but
we are using the classic spooling method; is this expected behavior for
that choice?

Sincerely,
Jonathan

Jonathan Pierce
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518
jonathan.pierce at loni.ucla.edu <mailto:jonathan.pierce at loni.ucla.edu> 


On Aug 11, 2007, at 4:41 AM, Reuti wrote:


	Hi,

	Am 11.08.2007 um 04:08 schrieb Jonathan Pierce:


		We have two machines set up for qmaster
responsibilities: cerebro-rmn1 and cerebro-rmn2.  Right now,
cerebro-rmn1 is master.  If I attempt to qstat from cerebro-rmn2:

		<hdezmora at cerebro-rmn2.data> qstat
		error: commlib error: got read error (closing
"cerebro-rmn1.data/qmaster/1")
		error: failed sending gdi request


	you set up the shadow master like
http://gridengine.sunsource.net/howto/shadow.html and a BDB server?

	-- Reuti


		and the following appears in
/usr/sge/loni/spool/qmaster/messages (loni is the cell):

		08/10/2007 18:44:21|qmaster|cerebro-rmn1|E|commlib
error: got read error (closing "cerebro-rsn2.data/qstat/6")


		Now, this may be related to another issue we're
experiencing.  Originally, when we tried to migrate services, it shut
down the qmaster, but failed to start it on the second, leaving the grid
engine in an unusable state until qmaster was manually started on one or
the other.  The main issue was that the lock file was not being deleted.
We hacked the script as follows:

		lock_file_read_retries=15
		       lock_file_read_count=0
		       lock_file_found=0
		       while [ $lock_file_read_count -lt
$lock_file_read_retries ]; do
		          if [ -f $qmaster_spool_dir/lock ]; then
		              rm $qmaster_spool_dir/lock
		              lock_file_found=1
		             break
		          fi
		          sleep 5
		          lock_file_read_count=`expr
$lock_file_read_count + 1`
		       done

		where the defaults are lock_file_read_retries=10 and
sleep 3; the "rm $qmaster_spool_dir/lock" line was added.  I would
assume that migrate should already work on its own, but we added this as
a (hopeful) temporary fix.  Included this information in case it's
helpful to anybody in figuring out what's wrong.  Any assistance would
be greatly appreciated.

		Thank you very much,
		Jonathan

		Jonathan Pierce
		System Administrator
		Laboratory of Neuro Imaging, UCLA
		635 Charles E. Young Drive South, Suite 225
		Los Angeles, CA 90095-7332
		Tel: 310.267.5076
		Cell: 310.487.8365
		Fax: 310.206.5518
		jonathan.pierce at loni.ucla.edu




	
---------------------------------------------------------------------
	To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
	For additional commands, e-mail:
users-help at gridengine.sunsource.net






More information about the gridengine-users mailing list