[GE users] Migration Issue + Conditional "failed receiving gdi request" error

Jonathan Pierce jonathan.pierce at loni.ucla.edu
Mon Aug 13 21:23:36 BST 2007


Hi Mac,

We upgraded our systems on Friday to 6.1u2, and much care was taken  
to ensure all systems involved were upgraded appropriately (as we had  
experienced this issue with 6.1, as well; sidenote: this is a new  
'test' cluster we've just recently configured).  Unfortunately,  
there's nothing helpful in those files -- just messages regarding our  
scheduled startups and shutdowns.  Is there any other place we could  
look for debug output?

Thank you very much,
Jonathan

Jonathan Pierce
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518
jonathan.pierce at loni.ucla.edu


On Aug 13, 2007, at 12:22 PM, McCalla, Mac wrote:

> Hi Jonathan
> No, we use classic spooling approach here (currently 6.0u7), and  
> can migrate (most recently last month) between qmaster and shadow  
> systems without doing anything more than the migrate command.   Is  
> the environment variable setup the same when logging into the  
> cerebro-rmn2 machine? Is it possible there is a mismatch in sge  
> software levels between the 2 systems?  Are there any informative  
> messages in the shadow master messages file? ( $SGE_ROOT/$CELL/ 
> spool/qmaster/messages_shadowd.$hostname )
>
> Mac
>
> From: Jonathan Pierce [mailto:jonathan.pierce at loni.ucla.edu]
> Sent: Monday, August 13, 2007 12:38 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Migration Issue + Conditional "failed  
> receiving gdi request" error
>
> Hi Reuti,
>
> Thank you for the reply.  The shadow master was set up as  
> described, but we are using the classic spooling method; is this  
> expected behavior for that choice?
>
> Sincerely,
> Jonathan
>
> Jonathan Pierce
> System Administrator
> Laboratory of Neuro Imaging, UCLA
> 635 Charles E. Young Drive South, Suite 225
> Los Angeles, CA 90095-7332
> Tel: 310.267.5076
> Cell: 310.487.8365
> Fax: 310.206.5518
> jonathan.pierce at loni.ucla.edu
>
>
> On Aug 11, 2007, at 4:41 AM, Reuti wrote:
>
>> Hi,
>>
>> Am 11.08.2007 um 04:08 schrieb Jonathan Pierce:
>>
>>> We have two machines set up for qmaster responsibilities: cerebro- 
>>> rmn1 and cerebro-rmn2.  Right now, cerebro-rmn1 is master.  If I  
>>> attempt to qstat from cerebro-rmn2:
>>>
>>> <hdezmora at cerebro-rmn2.data> qstat
>>> error: commlib error: got read error (closing "cerebro-rmn1.data/ 
>>> qmaster/1")
>>> error: failed sending gdi request
>>
>> you set up the shadow master like http://gridengine.sunsource.net/ 
>> howto/shadow.html and a BDB server?
>>
>> -- Reuti
>>
>>> and the following appears in /usr/sge/loni/spool/qmaster/messages  
>>> (loni is the cell):
>>>
>>> 08/10/2007 18:44:21|qmaster|cerebro-rmn1|E|commlib error: got  
>>> read error (closing "cerebro-rsn2.data/qstat/6")
>>>
>>>
>>> Now, this may be related to another issue we're experiencing.   
>>> Originally, when we tried to migrate services, it shut down the  
>>> qmaster, but failed to start it on the second, leaving the grid  
>>> engine in an unusable state until qmaster was manually started on  
>>> one or the other.  The main issue was that the lock file was not  
>>> being deleted.  We hacked the script as follows:
>>>
>>> lock_file_read_retries=15
>>>        lock_file_read_count=0
>>>        lock_file_found=0
>>>        while [ $lock_file_read_count -lt  
>>> $lock_file_read_retries ]; do
>>>           if [ -f $qmaster_spool_dir/lock ]; then
>>>               rm $qmaster_spool_dir/lock
>>>               lock_file_found=1
>>>              break
>>>           fi
>>>           sleep 5
>>>           lock_file_read_count=`expr $lock_file_read_count + 1`
>>>        done
>>>
>>> where the defaults are lock_file_read_retries=10 and sleep 3; the  
>>> "rm $qmaster_spool_dir/lock" line was added.  I would assume that  
>>> migrate should already work on its own, but we added this as a  
>>> (hopeful) temporary fix.  Included this information in case it's  
>>> helpful to anybody in figuring out what's wrong.  Any assistance  
>>> would be greatly appreciated.
>>>
>>> Thank you very much,
>>> Jonathan
>>>
>>> Jonathan Pierce
>>> System Administrator
>>> Laboratory of Neuro Imaging, UCLA
>>> 635 Charles E. Young Drive South, Suite 225
>>> Los Angeles, CA 90095-7332
>>> Tel: 310.267.5076
>>> Cell: 310.487.8365
>>> Fax: 310.206.5518
>>> jonathan.pierce at loni.ucla.edu
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>




More information about the gridengine-users mailing list