[GE users] Migration Issue + Conditional "failed receiving gdi request" error

Jonathan Pierce jonathan.pierce at loni.ucla.edu
Tue Aug 14 01:36:22 BST 2007


Thanks again to everybody who offered advice.  Turns out the issue  
was with cerebro-rmn2 having MTU set 9000 after we had finished  
testing jumbo frames and had set the switch back to standard MTU  
(it's always the simple things..).  Tried removing the 'hack' from  
the sgemaster script, but migration still behaves as it did before.  
Going to spend some time debugging this; I feel like this error lies  
with the NFS config.  I'll update here if I come to a conclusion on it.

Sincerely,
Jonathan

Jonathan Pierce
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518
jonathan.pierce at loni.ucla.edu


On Aug 13, 2007, at 3:05 PM, McCalla, Mac wrote:

> Another thought,  look into $SGE_ROOT/util/dl scripts which can be  
> used to set SGE debug level.  try setting the level to 1 or 2 and  
> then issueing the qstat command.  This may give you something useful.
>
> Mac
>
> From: Jonathan Pierce [mailto:jonathan.pierce at loni.ucla.edu]
> Sent: Monday, August 13, 2007 3:24 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Migration Issue + Conditional "failed  
> receiving gdi request" error
>
> Hi Mac,
>
> We upgraded our systems on Friday to 6.1u2, and much care was taken  
> to ensure all systems involved were upgraded appropriately (as we  
> had experienced this issue with 6.1, as well; sidenote: this is a  
> new 'test' cluster we've just recently configured).  Unfortunately,  
> there's nothing helpful in those files -- just messages regarding  
> our scheduled startups and shutdowns.  Is there any other place we  
> could look for debug output?
>
> Thank you very much,
> Jonathan
>
> Jonathan Pierce
> System Administrator
> Laboratory of Neuro Imaging, UCLA
> 635 Charles E. Young Drive South, Suite 225
> Los Angeles, CA 90095-7332
> Tel: 310.267.5076
> Cell: 310.487.8365
> Fax: 310.206.5518
> jonathan.pierce at loni.ucla.edu
>
>
> On Aug 13, 2007, at 12:22 PM, McCalla, Mac wrote:
>
>> Hi Jonathan
>> No, we use classic spooling approach here (currently 6.0u7), and  
>> can migrate (most recently last month) between qmaster and shadow  
>> systems without doing anything more than the migrate command.   Is  
>> the environment variable setup the same when logging into the  
>> cerebro-rmn2 machine? Is it possible there is a mismatch in sge  
>> software levels between the 2 systems?  Are there any informative  
>> messages in the shadow master messages file? ( $SGE_ROOT/$CELL/ 
>> spool/qmaster/messages_shadowd.$hostname )
>>
>> Mac
>>
>> From: Jonathan Pierce [mailto:jonathan.pierce at loni.ucla.edu]
>> Sent: Monday, August 13, 2007 12:38 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Migration Issue + Conditional "failed  
>> receiving gdi request" error
>>
>> Hi Reuti,
>>
>> Thank you for the reply.  The shadow master was set up as  
>> described, but we are using the classic spooling method; is this  
>> expected behavior for that choice?
>>
>> Sincerely,
>> Jonathan
>>
>> Jonathan Pierce
>> System Administrator
>> Laboratory of Neuro Imaging, UCLA
>> 635 Charles E. Young Drive South, Suite 225
>> Los Angeles, CA 90095-7332
>> Tel: 310.267.5076
>> Cell: 310.487.8365
>> Fax: 310.206.5518
>> jonathan.pierce at loni.ucla.edu
>>
>>
>> On Aug 11, 2007, at 4:41 AM, Reuti wrote:
>>
>>> Hi,
>>>
>>> Am 11.08.2007 um 04:08 schrieb Jonathan Pierce:
>>>
>>>> We have two machines set up for qmaster responsibilities:  
>>>> cerebro-rmn1 and cerebro-rmn2.  Right now, cerebro-rmn1 is  
>>>> master.  If I attempt to qstat from cerebro-rmn2:
>>>>
>>>> <hdezmora at cerebro-rmn2.data> qstat
>>>> error: commlib error: got read error (closing "cerebro-rmn1.data/ 
>>>> qmaster/1")
>>>> error: failed sending gdi request
>>>
>>> you set up the shadow master like http://gridengine.sunsource.net/ 
>>> howto/shadow.html and a BDB server?
>>>
>>> -- Reuti
>>>
>>>> and the following appears in /usr/sge/loni/spool/qmaster/ 
>>>> messages (loni is the cell):
>>>>
>>>> 08/10/2007 18:44:21|qmaster|cerebro-rmn1|E|commlib error: got  
>>>> read error (closing "cerebro-rsn2.data/qstat/6")
>>>>
>>>>
>>>> Now, this may be related to another issue we're experiencing.   
>>>> Originally, when we tried to migrate services, it shut down the  
>>>> qmaster, but failed to start it on the second, leaving the grid  
>>>> engine in an unusable state until qmaster was manually started  
>>>> on one or the other.  The main issue was that the lock file was  
>>>> not being deleted.  We hacked the script as follows:
>>>>
>>>> lock_file_read_retries=15
>>>>        lock_file_read_count=0
>>>>        lock_file_found=0
>>>>        while [ $lock_file_read_count -lt  
>>>> $lock_file_read_retries ]; do
>>>>           if [ -f $qmaster_spool_dir/lock ]; then
>>>>               rm $qmaster_spool_dir/lock
>>>>               lock_file_found=1
>>>>              break
>>>>           fi
>>>>           sleep 5
>>>>           lock_file_read_count=`expr $lock_file_read_count + 1`
>>>>        done
>>>>
>>>> where the defaults are lock_file_read_retries=10 and sleep 3;  
>>>> the "rm $qmaster_spool_dir/lock" line was added.  I would assume  
>>>> that migrate should already work on its own, but we added this  
>>>> as a (hopeful) temporary fix.  Included this information in case  
>>>> it's helpful to anybody in figuring out what's wrong.  Any  
>>>> assistance would be greatly appreciated.
>>>>
>>>> Thank you very much,
>>>> Jonathan
>>>>
>>>> Jonathan Pierce
>>>> System Administrator
>>>> Laboratory of Neuro Imaging, UCLA
>>>> 635 Charles E. Young Drive South, Suite 225
>>>> Los Angeles, CA 90095-7332
>>>> Tel: 310.267.5076
>>>> Cell: 310.487.8365
>>>> Fax: 310.206.5518
>>>> jonathan.pierce at loni.ucla.edu
>>>>
>>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>




More information about the gridengine-users mailing list