[GE users] Migration Issue + Conditional "failed receiving gdi request" error

McCalla, Mac macmccalla at hess.com
Mon Aug 13 22:12:25 BST 2007


Not that I can think of right now...this still sounds like a basic
environment kind of issue to me.  the only other situation i have seen
the failed receiving gdi request messages, is if the qmaster
communication processes are too busy doing something else, and the
request times out. 
 
Mac

________________________________

From: Jonathan Pierce [mailto:jonathan.pierce at loni.ucla.edu] 
Sent: Monday, August 13, 2007 3:24 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Migration Issue + Conditional "failed receiving
gdi request" error


Hi Mac, 

We upgraded our systems on Friday to 6.1u2, and much care was taken to
ensure all systems involved were upgraded appropriately (as we had
experienced this issue with 6.1, as well; sidenote: this is a new 'test'
cluster we've just recently configured).  Unfortunately, there's nothing
helpful in those files -- just messages regarding our scheduled startups
and shutdowns.  Is there any other place we could look for debug output?

Thank you very much,
Jonathan

Jonathan Pierce
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518
jonathan.pierce at loni.ucla.edu <mailto:jonathan.pierce at loni.ucla.edu> 


On Aug 13, 2007, at 12:22 PM, McCalla, Mac wrote:


	Hi Jonathan
	No, we use classic spooling approach here (currently 6.0u7), and
can migrate (most recently last month) between qmaster and shadow
systems without doing anything more than the migrate command.   Is the
environment variable setup the same when logging into the cerebro-rmn2
machine? Is it possible there is a mismatch in sge software levels
between the 2 systems?  Are there any informative messages in the shadow
master messages file? (
$SGE_ROOT/$CELL/spool/qmaster/messages_shadowd.$hostname )
	 
	Mac

________________________________

	From: Jonathan Pierce [mailto:jonathan.pierce at loni.ucla.edu] 
	Sent: Monday, August 13, 2007 12:38 PM
	To: users at gridengine.sunsource.net
	Subject: Re: [GE users] Migration Issue + Conditional "failed
receiving gdi request" error
	
	
	Hi Reuti, 

	Thank you for the reply.  The shadow master was set up as
described, but we are using the classic spooling method; is this
expected behavior for that choice?

	Sincerely,
	Jonathan

	
	Jonathan Pierce
	System Administrator
	Laboratory of Neuro Imaging, UCLA
	635 Charles E. Young Drive South, Suite 225
	Los Angeles, CA 90095-7332
	Tel: 310.267.5076
	Cell: 310.487.8365
	Fax: 310.206.5518
	jonathan.pierce at loni.ucla.edu
<mailto:jonathan.pierce at loni.ucla.edu> 


	On Aug 11, 2007, at 4:41 AM, Reuti wrote:


		Hi,

		Am 11.08.2007 um 04:08 schrieb Jonathan Pierce:


			We have two machines set up for qmaster
responsibilities: cerebro-rmn1 and cerebro-rmn2.  Right now,
cerebro-rmn1 is master.  If I attempt to qstat from cerebro-rmn2:

			<hdezmora at cerebro-rmn2.data> qstat
			error: commlib error: got read error (closing
"cerebro-rmn1.data/qmaster/1")
			error: failed sending gdi request


		you set up the shadow master like
http://gridengine.sunsource.net/howto/shadow.html and a BDB server?

		-- Reuti


			and the following appears in
/usr/sge/loni/spool/qmaster/messages (loni is the cell):

			08/10/2007
18:44:21|qmaster|cerebro-rmn1|E|commlib error: got read error (closing
"cerebro-rsn2.data/qstat/6")


			Now, this may be related to another issue we're
experiencing.  Originally, when we tried to migrate services, it shut
down the qmaster, but failed to start it on the second, leaving the grid
engine in an unusable state until qmaster was manually started on one or
the other.  The main issue was that the lock file was not being deleted.
We hacked the script as follows:

			lock_file_read_retries=15
			       lock_file_read_count=0
			       lock_file_found=0
			       while [ $lock_file_read_count -lt
$lock_file_read_retries ]; do
			          if [ -f $qmaster_spool_dir/lock ];
then
			              rm $qmaster_spool_dir/lock
			              lock_file_found=1
			             break
			          fi
			          sleep 5
			          lock_file_read_count=`expr
$lock_file_read_count + 1`
			       done

			where the defaults are lock_file_read_retries=10
and sleep 3; the "rm $qmaster_spool_dir/lock" line was added.  I would
assume that migrate should already work on its own, but we added this as
a (hopeful) temporary fix.  Included this information in case it's
helpful to anybody in figuring out what's wrong.  Any assistance would
be greatly appreciated.

			Thank you very much,
			Jonathan

			Jonathan Pierce
			System Administrator
			Laboratory of Neuro Imaging, UCLA
			635 Charles E. Young Drive South, Suite 225
			Los Angeles, CA 90095-7332
			Tel: 310.267.5076
			Cell: 310.487.8365
			Fax: 310.206.5518
			jonathan.pierce at loni.ucla.edu




	
---------------------------------------------------------------------
		To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
		For additional commands, e-mail:
users-help at gridengine.sunsource.net







More information about the gridengine-users mailing list