[GE users] Migration Issue + Conditional "failed receiving gdi request" error

Jonathan Pierce jonathan.pierce at loni.ucla.edu
Mon Aug 13 18:38:28 BST 2007


Hi Reuti,

Thank you for the reply.  The shadow master was set up as described,  
but we are using the classic spooling method; is this expected  
behavior for that choice?

Sincerely,
Jonathan

Jonathan Pierce
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518
jonathan.pierce at loni.ucla.edu


On Aug 11, 2007, at 4:41 AM, Reuti wrote:

> Hi,
>
> Am 11.08.2007 um 04:08 schrieb Jonathan Pierce:
>
>> We have two machines set up for qmaster responsibilities: cerebro- 
>> rmn1 and cerebro-rmn2.  Right now, cerebro-rmn1 is master.  If I  
>> attempt to qstat from cerebro-rmn2:
>>
>> <hdezmora at cerebro-rmn2.data> qstat
>> error: commlib error: got read error (closing "cerebro-rmn1.data/ 
>> qmaster/1")
>> error: failed sending gdi request
>
> you set up the shadow master like http://gridengine.sunsource.net/ 
> howto/shadow.html and a BDB server?
>
> -- Reuti
>
>> and the following appears in /usr/sge/loni/spool/qmaster/messages  
>> (loni is the cell):
>>
>> 08/10/2007 18:44:21|qmaster|cerebro-rmn1|E|commlib error: got read  
>> error (closing "cerebro-rsn2.data/qstat/6")
>>
>>
>> Now, this may be related to another issue we're experiencing.   
>> Originally, when we tried to migrate services, it shut down the  
>> qmaster, but failed to start it on the second, leaving the grid  
>> engine in an unusable state until qmaster was manually started on  
>> one or the other.  The main issue was that the lock file was not  
>> being deleted.  We hacked the script as follows:
>>
>> lock_file_read_retries=15
>>        lock_file_read_count=0
>>        lock_file_found=0
>>        while [ $lock_file_read_count -lt  
>> $lock_file_read_retries ]; do
>>           if [ -f $qmaster_spool_dir/lock ]; then
>>               rm $qmaster_spool_dir/lock
>>               lock_file_found=1
>>              break
>>           fi
>>           sleep 5
>>           lock_file_read_count=`expr $lock_file_read_count + 1`
>>        done
>>
>> where the defaults are lock_file_read_retries=10 and sleep 3; the  
>> "rm $qmaster_spool_dir/lock" line was added.  I would assume that  
>> migrate should already work on its own, but we added this as a  
>> (hopeful) temporary fix.  Included this information in case it's  
>> helpful to anybody in figuring out what's wrong.  Any assistance  
>> would be greatly appreciated.
>>
>> Thank you very much,
>> Jonathan
>>
>> Jonathan Pierce
>> System Administrator
>> Laboratory of Neuro Imaging, UCLA
>> 635 Charles E. Young Drive South, Suite 225
>> Los Angeles, CA 90095-7332
>> Tel: 310.267.5076
>> Cell: 310.487.8365
>> Fax: 310.206.5518
>> jonathan.pierce at loni.ucla.edu
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>




More information about the gridengine-users mailing list