[GE users] Migration Issue + Conditional "failed receiving gdi request" error

Hugo R. Hernández-Mora hugo.hernandez at loni.ucla.edu
Sun Aug 12 18:58:04 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti,
yes we did configure the shadow hosts like in the documentation you 
referenced.    We are not using a BDB server.   We are using the classic 
spool method.   For the execution hosts we are using local spool 
directory instead of NFS spool directory but for all the grid, including 
qmaster, shadow, and execution hosts we are mounting the NFS filesystem 
/usr/sge.
- Hugo

Reuti wrote:
> Hi,
>
> Am 11.08.2007 um 04:08 schrieb Jonathan Pierce:
>
>> We have two machines set up for qmaster responsibilities: 
>> cerebro-rmn1 and cerebro-rmn2.  Right now, cerebro-rmn1 is master.  
>> If I attempt to qstat from cerebro-rmn2:
>>
>> <hdezmora at cerebro-rmn2.data> qstat
>> error: commlib error: got read error (closing 
>> "cerebro-rmn1.data/qmaster/1")
>> error: failed sending gdi request
>
> you set up the shadow master like 
> http://gridengine.sunsource.net/howto/shadow.html and a BDB server?
>
> -- Reuti
>
>> and the following appears in /usr/sge/loni/spool/qmaster/messages 
>> (loni is the cell):
>>
>> 08/10/2007 18:44:21|qmaster|cerebro-rmn1|E|commlib error: got read 
>> error (closing "cerebro-rsn2.data/qstat/6")
>>
>>
>> Now, this may be related to another issue we're experiencing.  
>> Originally, when we tried to migrate services, it shut down the 
>> qmaster, but failed to start it on the second, leaving the grid 
>> engine in an unusable state until qmaster was manually started on one 
>> or the other.  The main issue was that the lock file was not being 
>> deleted.  We hacked the script as follows:
>>
>> lock_file_read_retries=15
>>        lock_file_read_count=0
>>        lock_file_found=0
>>        while [ $lock_file_read_count -lt $lock_file_read_retries ]; do
>>           if [ -f $qmaster_spool_dir/lock ]; then
>>               rm $qmaster_spool_dir/lock
>>               lock_file_found=1
>>              break
>>           fi
>>           sleep 5
>>           lock_file_read_count=`expr $lock_file_read_count + 1`
>>        done
>>
>> where the defaults are lock_file_read_retries=10 and sleep 3; the "rm 
>> $qmaster_spool_dir/lock" line was added.  I would assume that migrate 
>> should already work on its own, but we added this as a (hopeful) 
>> temporary fix.  Included this information in case it's helpful to 
>> anybody in figuring out what's wrong.  Any assistance would be 
>> greatly appreciated.
>>
>> Thank you very much,
>> Jonathan
>>
>> Jonathan Pierce
>> System Administrator
>> Laboratory of Neuro Imaging, UCLA
>> 635 Charles E. Young Drive South, Suite 225
>> Los Angeles, CA 90095-7332
>> Tel: 310.267.5076
>> Cell: 310.487.8365
>> Fax: 310.206.5518
>> jonathan.pierce at loni.ucla.edu
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

-- 
Hugo R. Hernandez-Mora
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Fax: 310.206.5518
hugo.hernandez at loni.ucla.edu
-- 

"Si seus esfor?os, foram vistos com indefren?a, não desanime, que o sol faze un espectacolo maravilhoso todas as manhãs cuando a maior parte das pessoas, ainda estam durmindo" 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list