[GE users] Migration Issue + Conditional "failed receiving gdi request" error

Reuti reuti at staff.uni-marburg.de
Mon Aug 13 23:34:31 BST 2007


Hi,

Am 13.08.2007 um 19:38 schrieb Jonathan Pierce:

> Thank you for the reply.  The shadow master was set up as  
> described, but we are using the classic spooling method; is this  
> expected behavior for that choice?

so your spool and qmaster directory is on another machine. How is  
this shared to the qmasters? NFS2/3/4 and what settings did you use?

-- Reuti

> Sincerely,
> Jonathan
>
> Jonathan Pierce
> System Administrator
> Laboratory of Neuro Imaging, UCLA
> 635 Charles E. Young Drive South, Suite 225
> Los Angeles, CA 90095-7332
> Tel: 310.267.5076
> Cell: 310.487.8365
> Fax: 310.206.5518
> jonathan.pierce at loni.ucla.edu
>
>
> On Aug 11, 2007, at 4:41 AM, Reuti wrote:
>
>> Hi,
>>
>> Am 11.08.2007 um 04:08 schrieb Jonathan Pierce:
>>
>>> We have two machines set up for qmaster responsibilities: cerebro- 
>>> rmn1 and cerebro-rmn2.  Right now, cerebro-rmn1 is master.  If I  
>>> attempt to qstat from cerebro-rmn2:
>>>
>>> <hdezmora at cerebro-rmn2.data> qstat
>>> error: commlib error: got read error (closing "cerebro-rmn1.data/ 
>>> qmaster/1")
>>> error: failed sending gdi request
>>
>> you set up the shadow master like http://gridengine.sunsource.net/ 
>> howto/shadow.html and a BDB server?
>>
>> -- Reuti
>>
>>> and the following appears in /usr/sge/loni/spool/qmaster/messages  
>>> (loni is the cell):
>>>
>>> 08/10/2007 18:44:21|qmaster|cerebro-rmn1|E|commlib error: got  
>>> read error (closing "cerebro-rsn2.data/qstat/6")
>>>
>>>
>>> Now, this may be related to another issue we're experiencing.   
>>> Originally, when we tried to migrate services, it shut down the  
>>> qmaster, but failed to start it on the second, leaving the grid  
>>> engine in an unusable state until qmaster was manually started on  
>>> one or the other.  The main issue was that the lock file was not  
>>> being deleted.  We hacked the script as follows:
>>>
>>> lock_file_read_retries=15
>>>        lock_file_read_count=0
>>>        lock_file_found=0
>>>        while [ $lock_file_read_count -lt  
>>> $lock_file_read_retries ]; do
>>>           if [ -f $qmaster_spool_dir/lock ]; then
>>>               rm $qmaster_spool_dir/lock
>>>               lock_file_found=1
>>>              break
>>>           fi
>>>           sleep 5
>>>           lock_file_read_count=`expr $lock_file_read_count + 1`
>>>        done
>>>
>>> where the defaults are lock_file_read_retries=10 and sleep 3; the  
>>> "rm $qmaster_spool_dir/lock" line was added.  I would assume that  
>>> migrate should already work on its own, but we added this as a  
>>> (hopeful) temporary fix.  Included this information in case it's  
>>> helpful to anybody in figuring out what's wrong.  Any assistance  
>>> would be greatly appreciated.
>>>
>>> Thank you very much,
>>> Jonathan
>>>
>>> Jonathan Pierce
>>> System Administrator
>>> Laboratory of Neuro Imaging, UCLA
>>> 635 Charles E. Young Drive South, Suite 225
>>> Los Angeles, CA 90095-7332
>>> Tel: 310.267.5076
>>> Cell: 310.487.8365
>>> Fax: 310.206.5518
>>> jonathan.pierce at loni.ucla.edu
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list